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FOREWORD 


This  publication  is  part  of  a  larger  program  on  criterion-referenced, 
performance-oriented  evaluation  being  conducted  by  the  U.S.  Army 
Research  Institute  for  the  Behavorial  &  Social  Sciences  (ARI) .  A  major 
goal  of  the  program  has  been  to  develop  procedures  for  applying  CRT  Theory 
to  a  variety  of  trailing  situations,  including  crew  and  tactical  training. 

This  report  su^-marizes  an  analysis  of  the  state-of-the-art  on 
criterion-referenced  testing,  which  preceded  the  preparation  of  a  test 
construction  handbook  ("Guidebook  for  Developing  Criterion-Referenced 
Tests.  ARI  Report,  P-75  1,  August  1975).  Related  efforts  have  included 
scoring  procedures  for  performance-based  training  in  tank  gunnery  (IDOC) 
and  experiments  to  compare  the  accuracy  of  several  CRT  models  in  fitting 
empirical  data  (METTEST) . 

ARI  research  in  this  area  is  conducted  as  an  in-house  effort  aug¬ 
mented  by  contracts  with  organizations  selected  as  having  unique 
capabilities  and  facilities  for  research  in  a  specific  area.  The  present 
study  was  conducted  by  personnel  of  the  Army  Research  Institute  and 
Applied  Sciences  Associates,  Inc.,  under  Contract  Number  DAHC-19-74- 
C-0018,  and  was  responsive  to  the  requirements  of  RDTE  Project 
2Q164715A757 ,  Training  Systems  Applications. 


V 


BRIEF 


Requirement: 

To  analyze  current  state-of-the-art  in  criterion-referenced  testing,  and 
to  establish  positions  on  various  test  contruction  issues. 

Procedure : 

A  review  and  analysis  of  the  literature  related  to  criterion-referenced 
testing  was  undertaken.  This  review  included  military  field  manuals,  tech¬ 
nical  reports  and  personal  communications,  as  well  as  professional  journals. 
Major  topics  Included  definition  and  use  of  CRTs,  reliability/validity,  and 
test  construction. 

Findings: 

The  findings  consisted  of  a  set  of  position  statements  under  four  major 
headings: 

1.  Design  considerations  and  CRT  use 

2.  Construction  methodology  and  related  issues 

3.  CRT  administration  and  scoring 

4.  Reliability  and  validity 
Utilization  of  Findings: 

The  findings  of  this  study  provided  a  major  basis  for  the  preparation 
of  ARI  Report  P  75  1,  "Guidebook  for  Developing  Criterion-Referenced  Tests." 
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Abstract 

A  review  of  the  technical  and  theoretical  literature  in  the  area  of  Criterion- 
Referenced  Testing  (CR  Testing)  is  presented.  A  number  of  areas  in  CRT 
development  and  application  are  considered.  Discussed,  in  turn,  are  questions 
of  CRT  reliability  and  validity  in  both  practical  and  theoretical  areas. 
Different  methods  of  CRT  construction  are  reviewed,  as  la  the  question  of 
simulation  fidelity  (e.g.,  the  estimate  to  which  CRTs  can  and  should  mirror 
real-world  performance  conditions).  Discussion  is  directed  to  the  use  of 
CRTs  in  mastery  learning  contexts  and  to  test  materials  development  and 
item  sampling.  Diagnostic  uses  of  CRTs  and  the  establishment  of  cut-off 
scores  are  considered.  Uses  of  CRTs  in  public  education  and  military  context 
are  reviewed.  Finally  a  position  is  set  forth  on  general  and  theoretical 
aspects  of  CRT  construction  and  use. 
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Introduction 

The  distinction  between  none-referenced  measurement  (NRM)  and  criterion- 
referenced  measurement  (CRM)  has  been  aptly  illustrated  by  Popham  and  Huaek 
(1969)  using  the  analogy  of  a  dog  owner  who  wants  to  keep  his  dog  in  the 
back  yard.  The  owner  finds  out  how  high  the  dog  can  jump  (a  criterion- 
referenced  test)  and  builds  a  fence  high  enough  to  keep  the  dog  in  the  back 
yard.  How  high  the  dog  can  jump  compared  to  other  dogs  (a  norm-referenced 
test)  is  irrelevant. 

Beginning  with  Glaser  (1963) ,  a  number  of  researchers  have  made  similar 
distinctions.  Folley  (1967)  for  example,  has  discussed  this  distinction 
in  the  areas  of  predictive  testing  and  achievement  testing.  In  the  case 
of  predictive  testing  the  standard  is  relative;  the  results  attempt  to  show 
how  any  single  individual  compares  with  all  other  individuals  who  have  taken 
the  test.  In  achievement  testing  however,  the  standard  is  absolute..  The 
results  attempt  to  show  the  extent  to  which  an  individual  has  learned  a 
specific  set  of  behaviors.  Discrimination  among  individuals  is  of  secondary 
Importance. 

Glaser  and  Nltko  (1971,  p.  653)  have  defined  a  criterion-referenced 
test  (CRT)  as  "one  that  is  deliberately  constructed  to  yield  measurements 
that  are  directly  Interpretable  in  terms  of  specif led' performance  standards", 
a  definition  which  has  been  slightly  expanded  by  Livingston  (1972,. p.  13): 
"criterion-referenced  (isj  used  to  refer  to  any  test  for  which  a  criterion 
score  is  specified  without  reference  to  the  distribution  of  scores  of  a  group 
of  examinees."  Coaon  to  all  definitions  is  the  notion  that  a  well-defined 
content  domain  and  the  development  of  procedures  for  generating  appropriate 


Contemporary  Views 


3 

samples  of  test  items  are  important*  Lyon  (1972)  argues  for  the  use  of  CRM 
as  a  vital  part  of  training  quality  control: 

.  .  .  quality  control  requires  absolute  rather  than  relative 
criteria.  Scores  and  grades  must  reflect  how  many  course 
objectives  have  been  mastered  rather  than  how  a  student  compares 
with  other  students. 

Carver  (1974)  has  classified  tests  as  primarily  "psychometric",  if  they 
focus  on  individual  differences,  or  primarily  "edumetric",  if  they  are  designed 
for  sensitivity  to  wlthln-indlvidual  gains.  Psychometrically  designed  tests, 
in  Carver ' s  view,  may  not  be  suitable  for  measuring  individual  gains,  even  . 
though  they  are  often  used  for  that  purpose.  Carver's  classification  can  be 
applied  to  the  CRT-NRT  distinction:  Generally,  NRTs  are  psychometric  tests, 
while  CRTs  are  predominantly  edumetric. 

For  the  purposes  cf  this  review,  a  CRT  will  be  defined  as  a  test  from 
which  the  score  of  an  individual  is  Interpreted  against  an  external  standard 
(e.g.,  a  standard  other  than  the  distribution  of  scores  of  other  examinees). 
Further,  CRTs  are  tests  whose  items  are  operational  definitions  of  behavioral 
objectives. 

The  literature  of  psychology  and  education  cohtains  reference  to  no  more 
persistent  problem  than  that  of  criterion  specification  (Ronan  &  Prien,  1973). 
Still,  no  generally  accepted  method  for  selecting  relevant,  reliable,  and 
practical  criterion  measures  exists  today. 


Contemporary  Views 
4 


Uae 

The  contemporary  interest  in  mastery  learning  has  led  to  a  growing 
interest  in  the  use  of  CRM.  CRTs  can  be  used  to  serve  at  least  two  purposes: 

1.  They  can  be  used  to  provide  specific  information  about  the 
performance  levels  of  individuals  on  instructional  objectives. 

This  information  can  be  used  to  support  a  decision  about  "mastery" 
of  a  particular  objective  (Block,  1971). 

2.  They  can  be  used  to  evaluate  the  effectl/eness  of  instruction. 

NRTs  given  at  the  end  of  a  course  are  less  useful  for  making 
evaluative  decisions  about  Instructional  effectiveness,  since 
they  are  not  derived  from  particular  task  objectives.  CRTs  are, 
however,  useful  for  this  purpos..  because  of  the  specificity  of 
the  results  to  the  task  objectives  (Lord,  1962:  Cronbach,  1963; 
Shoemaker,  1970,  1970b;  Hambleton,  Rosinelli  and  Garth,  1971). 

Popham  (1973)  has  pointed  out  a  basic  concern  with  the  Instrument  itself: 
We  have  not  yet  made  an  acceptable  effort  to  delineate  the  defining 
dimensions  of  performance  tests,  in  terms  of  their  content,  objec- 
,  tives,  post-test  nature,  background  information , level,  etc.  Almost 
all  of  the  recently  developed  performance  tests  have  been  devised 
more  or  less  on  the  basis  of  experience  and  instruction. 

Ebel  (1971)  has  posed  a  series  of  arguments  against  the  use  of  CRM  in 
education.  Ebel  points  out  with  some  justification  that  CRTs  do  not  tell  us 
all  ve  need  to  know  about  educational  achievement,  pointing  out  that  they 
are  not  efficient  at  discovering  relative  strengths  and  deficiencies. 

Ebel  appears  to  confuse  the  concept  of  mastery  of  material  with  the  practice 
pf  using  percentile  grades  as  pass-fail  measures,  and  does  not  address  the 
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notion  that  CRTs  as  currently  constructed  are  r:.e  result  of  the  application 
of  a  carefully  thought  out  analysis  and  development  system. 

Klein  and  Kosecoff  (1973)  have  provided  a  useful  figure  summarizing 
uses  of  CRTs.  They  classified  CRT  uses  as  to  planning,  types  of  decision, 
and  research;  and  as  to  individual,  group,  or  program  evaluation  purposes. 

Reliability  and  Validity 

As  Glaser  and  Nltko  (1971)  point  out,  the  appropriate  technique  for  an 
empirical  estimation  of  CRT  reliability  is  unclear.  Popham  and  Husek  (1969) 
suggest  the  traditional  NRT  estimates  of  internal  consistency  and  stability 
are  not  often  appropriate  because  of  their  dependency  on  total  test  score 
variability.  CRTs  typically  are  interpreted  in  an  absolute  fashion,  hence, 
variability  is  drastically  reduced.  This  section  will  critically  examine, 
a  number  of  studies  which  have  addressed  the  question  of  reliability.  The 
question  of  validity  is  inextricably  mingled  with  the  reliability  issue  and 
also  presents  many  facets  of  opinion  and  theory.  Various  positions  concerning 
reliability  and  validity  will  be  discussed  in  turn. 

Reliability 

Smith  (1965)  has  proposed  that  reliability  of  test,  results  be  assessed 
by  the  range  of  variation  of  test  results; 

+  1.96 

where:  p  *  proportion  passing, 
q  ■  proportion ' failing 
N  “  number  taking  test. 

Smith  suggested  that  this  statistic  specifies  the  range  within  which  95Z 
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of  classes  would  fall  by  chance  variation.  Unfortunately,  Smith's  range  of 
variation  statistic  Is  of  limited  value,  since,  for  It  to  be  statistically 
valid,  all  classes  would  have  to  consist  of  the  same  number  of  students. 
Further,  as  Smith  noted,  the  range  only  holds  when  "...  the  next  class  is 
of  comparable  aptitude,  and.  .  .  no  significant  change,  for  better  or  for 
worse,  has  been  made  in  the  instruction.  .  ." 

Cox  and  Vargas  (1966)  compared  the  results  obtained  from  two  item 
analysis  procedures  using  both  pre-test  and  post-test  scores;  a  Difference 
Index  (DI)  was  obtained  in  two  ways.  A  post-test  minus  pre-test  DI  was 
obtained  by  subtracting  the  percentage  of  students  who.  passed  an  item  on 
the  pre-test  from  the  percentage  who  passed  that  item  on  the  post-test. 

A  DI  was.  also  obtained  for  each  item  in  the  more  conventional  manner:  The 
distribution  of  scores  on  the  post-test  was  divided  into  thirds  and  the 
percentage  of  students  in  the  lower  third  on  the  overall  test  who  passed 
the  Item  was  subtracted  from  the  percentage  of  students  in  the  upper  third 
who  passed  the  same  item.  The  Spearman  Rhos  obtained  between  the  two  DIs 
were  of  a  moderate  order.  The  authors  concluded  that  their  DI  differed 
sufficiently  from  the  traditional  method  to  warrant  its  use  with  CRTs. 

Hambleton  and  Garth  (1971)  replicated  the  work  of  Cox  and  Vargas  (1966) 
and  found  that  the  choice  of  statistic  does  Indeed  have  a  significant  effect 
on  the  selection  of  test  items.  The  change  in  item  difficulty  from  pre- 
to  post-test  seems  particularly  attractive  where  two  test  administrations 
are  possible.  Unfortunately,  this  method  uses  statistical  procedures  dependent 
on  score  variability  which  are  questionable  for  CRM  (Popham  and  Husek,  1969; 
Randall,  1972)(  particularly  if  the  method  is  to  be  employed  for  item  selection 
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(Oakland,  1972). 

Livingston  (1972a)  acknowledges  Popham  and  Husek's  (1969)  comment  that 
"the  typical  ink.  >xes  of  internal  consistency  are  not  appropriate  for  criterion- 
referenced  tests".  Nevertheless,  Livingston  (1971,  1972a,  1972c)  has  suggested 
that  the  classical  theory  of  true  and  error  scores  can  be  used  to  determine 
CRT  reliability.  Livingston  (1972a,  1972c)  points  out  that  "when  we  use 
criterion-referenced  measures  we  want  to  know  how  far.  .  .[a]  score  deviates 
from  a  fixed  standard."  In  Livingston's  model,  "...  each  concept  based 
on  deviations  from  a  mean  score.  .  .  [is]  replaced  by  a  corresponding  concept 
based  on  deviations  from  the  criterion  score."  In  this  view,  "...  criterion- 
referenced  reliability  can  be  interpreted  as  a  ratio  of  mean  squared  deviations 
from  the  criterion  score." 

Livingston  cites  Lord  and  Novick's  (1968)  definition  of  norm-referenced 
test  reliability  as  the  squared  correlation  between  observed  score  and  true 
score.  Based  on  this  definition,  Livingston  defines  criterion-referenced 
reliability  as  "the  squared  criterion-referenced  correlation  between  observed 
score  and  true  score".  Using  algebraic  proofs,  Livingston  demonstrates  that 
this  criterion-referenced  correlation  equals  the  ratio  of  mean  squared 
deviations  of  true  scores  from  the  criterion  score  to  the  mean  squared  devia¬ 
tion  of  observed  scores  from  the  criterion  score.  This  ratio  is,  of  course, 
predicated  on  the  assumption  that  one  can  substitute  for  variance  (a  concept 
based  on  differences  from  the  mean)  by  using  mean  squared  deviation  of  scores 
from  the  criterion  score.  If  this  view  is  accepted,  a  number  of  useful 
relationships  are  provided;  for  instance,  the  further  a  mean  score  is  from  the 
criterion  score,  the  greater  the  criterion-referenced  reliability  of  the  test 
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for  that  particular  group.  In  effect,  moving  the  mean  score  away  from  the 
criterion  score  has  the  same  effect  on  criterion-referenced  reliability  that 
increasing  the  variance  of  true  scores  has  on  norm-referenced  reliability. 

In  other  words,  errors  of  misclasslf ication  of  the  false  positive  variety 
cap  be  minimized  by  accepting  as  true  masters  the  group  that  comfortably 
exceeds  the  required  criterion  level; 

According  to  Livingston,  "...  the  farther  any  person's  obtained  score 
is  from  the  criterion  score,  the  more  confident  we  can  be  in  saying  that  his 
true  scbre  is  on  the  same  side  of  the  criterion  score".  That  is,  errors 
of  measurement  will  less  likely  cause  misclassiflcation  when  the  observed 
score  is  comfortably  distant  from  the  criterion  score. 

Another  point  is  that  if  we  accept  Livingston’s  model,  then  the  criter¬ 
ion-referenced  correlation  between  two  tests  depends  on  the  difficulty  level 
of  the  tests  for  the  particular  group  involved.  Two  tests  can  have  a  high 
correlation  only  if  each  is  of  similar  difficulty  for  a  group  of  examinees. 
This  limits  the  computation  of  lntnr-ltem  correlations,  as  it  is  often 
difficult  to  ensure  equal  difficulty  levels.  Livingston's  (1971,  1972c) 
paper  included  formulas  for  criterion-referenced  applications  of  che 
Spearman- Brown  formula,  coefficient  Alpha ^  and  correction  for  attenuation, 
as  well  as  the  derivation  of  his  basic  reliability  formula. 

Regarding  Livingston's  (1972a)  proposal  that  the  psychometric  theory  of 
true  and  error  scores  could  bfe  adapted  to  CRM,  Oakland  (1972)  commented  that 
the  procedures  seemed  viable  but  that  the  conditions  under  which  they  could 
be  used — i.e.,  availability  of  suitable  NRT  measures  of  criterion  behaviors, 
and  multi-item  rather  than  single  item  CRTs—were  overly  restrictive. 
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Harris  (1972)  objected  to  Llvingston^s  (1972a)  application  of  classical 
psychometric  theory  to  CRT,  pointing  out  that  whether  Livingston's  coeffi¬ 
cient  or  a  traditional  one  is  applied,  the  standard  error  of  measurement 
remains  the  same.  The  fact  that  Livingston's  coefficient  is  usually  the 
larger  does  not  mean  a  more  dependable  determination  of  whether  or  not  a 
true  score  falls  above  or  below  the  criterion  score.  As  a  rebuttal,  Livingston 
(1972b)  indicated  that  Harris  had  overlooked  the  point  that  reliability  is 
not  a  property  of  a  single  score  but  of  a  group  of  scores.  Livingston  also 
indicated  that  the  larger  criterion-referenced  reliability  does  imply  a  more 
dependable  overall  determination,  when  this  decision  is  to  be  made  for  all 
individual  scores  in  the  distribution. 

Meredith  and  Sabers  (1972)  also  took  Issue  with  Livingston's  concept 
of  CRT  reliability  estimation  as  variability  around  the  criterion  score, 
pointing  out  that  CRM  is  concerned  primarily  with  the  accuracy  of  the  pass- 
fall  decision  and  is  relatively  unconcerned  with  various  levels  of  attain¬ 
ment  above  or  below  the  criterion  level. 

Roudabush  and  Green  (1972)  presented  several  methods  for  arriving  at 
reliability  estimates  for  CRTs.  The  first  involves  ordering  items  hier¬ 
archically  according  to  increasing  difficulty.  Roudabush  and  Green  proposed 
that  error  of  measurement  is  demonstrated  if  a  student  falls  an  easier  item 
while  passing  a  series  of  more  difficult  items.  Oakland  (1972)  pointed  out 
that  it  is  very  difficult  to  establish  the  needed  hierarchical  order.  This 
objection  has  been  raised  since  Guttman  first  (1944)  proposed  the  technique 
of  hierarchical  ordering.  Roudabush  and  Green's  second  technique  used  point- 
biserlal  correlations  between  parallel  tests.  Their  results  with  this  method 
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were  far  from  encouraging  (reported  correlations  were  fairly  low)  and,  in 
addition,  there  is  difficulty  Inherent  in  the  development  of  parallel  tests. 
Their  third  method  involved  the  use  of  regreseion  equations  to  predict  item 
criterion  scores,  but  has  not  yet  been  fully  explored. 

Hambleton  and  Novick  (1971)  proposed  regarding  CRT  reliability  as  the 
consistency  of  decision-making  across  parallel  forms  of  the  CRT  or  across 
repeated  measures.  They  view  validity  as  the  accuracy  of  decision-making. 

This  view  departs  from  the  classic  psychometric  view  of  reliability  and 
validity.  Hambelton  and  Novick  view  a  decision-theoretic  metric  such  as  a 
"loss  function"  as  being  especially  appropriate  for  use  on  CRTs.  This  metric 
serves  to  describe  if  an  individual's  true  score  is  above  or  below  a  cutting 
score.  The  concept  differs  markedly  from  Livingston's  (1972a)  notion  of 
regarding  the  criterion  as  the  true  score. 

Swezey  and  Pearlstein  (1974)  suggested  a  simple  technique  for  establishing 
the  test-retest  reliability  of  CRTs.  The  same  group  (of  at  least  30  people) 
is  tested  twice,  close  together  in  time.  A  four-fold  table— first  test 
administration,  pass,  fall — is  then  created,  and  a  0  coefficient  computed. 

It  is  Important  that  the  test  group*  is  not  aware  that  it  will  be  retested; 
so  that  practice  does  not  occur  between  administrations.  Swezey  and  Pearlstein 
proposed  that  a  p  value  of  less  than  +.50  be  considered  indicative  of 
questionable  test  reliability. 

The  importance  of  correct  decision-making  In  CRT  applications  is  also 
recognized  by  Edmonston,  Randall,  and  Oakland  (1972),  who  presented  a  CRT 
reliability  model  aimed  at  supporting  decisions  made  during  formative  eval¬ 
uation  and  at  maximizing  the  probability  of  learning  an  established  set 
of  objectives.  Criterion-referenced  items  are  often  binarlly  coded  pass- 
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fall;  therefore,  summaries  of  group  performance  on  two  Items  of  pre-  and 
post-test  can  be  displayed  in  a  2  x  2  contingency  table.  Edmonston  et  al. 

recommended  utilizing  cell  proportions  to  provide  information  about  the 

relationships  between  variables , represented  in  the  table.  They  found  that 

a  simple  summation  of  the  diagonal  proportions  L  paa,  provides  a  very  useful 

a 

measure  of  agreement  between  categories — where  a  is  a  method  of  indicating 
cells  in  a  matrix  and  all  cells  have  the  same  classification  (pass-fail) . 

Thus,  £  paa  is  used  as  a  measure  of  association  of  cells  in  2  x  2 
tables,  as  opposed  to  chi-square  used  as  a  measure  of  independence.  Thus, 
the  coefficient  of  agreement  is  computed  from  cell  values,  rather  than  from 
marginal  values,  and  should  be  used,  according  to  Goodman  and  Kruskal  (1954), 
for  cases  "in  which  the  classes  are  the  same  for  two  polytomies.  .  .  but 
differ  in  that:  assignment  to  class  depends  on  which  of  two  methods  of  assign¬ 
ment  is  used".  Two  test  items,  both  of  which  are  scored  on  a  pass-fail  basis, 
satisfy  Goodman  and  Krupkal's  conditions  for  the  use  of  the  coefficient  of 
agreement. 

They  also  recommended  a  supplemental  measure  X  ^  (Lambda  sub  r)  a  variance- 
free  coefficient.  Goodman  and  Kruskal  (1954)  define  X  : 

t 

Z paa  -  1/2  (PM-  +  P-M) 

,  r  1  -  1/2  (PM*  +  P-M) 

where:  PM*  and  P*M  are  the  modal  class  frequencies  for  each  of  the  two 
cross-classifications.  X ^  may  be  interpreted  as  the  relative  reduction  in 
the  probability  of  error  of  classification  when  going  from  a  no-informatlon 
situation  to  an  other-method-known  situation.  The  no-informatlon  situation 
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refers  to  the  probability  of  correctly  classifying  an  individual  randomly 
selected  from  a  population  on  a  dichotomous  variable  (e.g.»  able  to  perform 
item  X,  or  not  able  to  perform  item  X)  when  no  information  is  known.  The 
other-method-known  situation  refers  to  the  probability  of  correct  classi¬ 
fication  on  item  X  when  the  classification  from  item  T  is  known,  or  vice-versa. 

X  can  take  values  from  -1  to  +1.  A  value  of  +1  indicates  that  both 

x 

measures  yield  the  same  classification  in  all  cases.  A  value  of  -1  Indicates 

that  the  two  measures  never  agree  (no  one  who  passes  item  1  passes  item  2, 

no  one  who  falls  item  1  fails  item  2)  and  that  the  two  modal  frequencies  of 

classification  sura  to  unity.  A  difficulty  with  using  is  that  if  two 

measures  are  Independent,  Xr  has  no  set  value — i.e.,  its  value  is  not  0. 

Edmonston  et  al.  feel  the  reliability  estimate  most  useful  to,  CRM  is 

the  extent  of  temporal  fluctuation.  They  suggested  that,  minimally,  CRT  items 

should  provide  stable  estimates  of  knowledge  of  curriculum  content;  £  paa 

.  a 

and  r  can  be  used  to  provide  estimates  of  this  stability.  They  recommend 

that  2  paa  be  used  to  judge  the  re-test  reliability  of  each  item.  However,  • 
a 

when  item  re-test  reliability  falls  below  an  arbitrary  criterion  (Edmonston 
et  al.  recommend  89Z)  and  into  a  zone  of  decision,  is  employed  ss  a 
descriptive  measure  of  the  amount  of  information  gained  by  employing  a 
second  item  (the  re-test)  in  making  curriculum  or  placement  decisions.  The 
method  for  making  such  decisions  is  not  clear.  Edmonston  et  al.  stated  only 
that  "if  knowledge  of  the  retest  score  provides  additional  information  as 
to  how  students  can  be  classified,  the  item  is  retained." 

In  the  same  vein  as  Edmonston  et  al.,  Roudabush  (1973)  described  rellabillt> 
as  the  appropriateness  of  decisions  affecting  the  treatment  of  examinees. 
Roudabush  emphasized  "minimizing  risk  or  cost  to  examines."  The  decision 
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is  whether  to  discontinue  instruction,  remediate,  or  wash-out. 

Validity 

As  is  the  case  with  NUT  development,  determination  of  validity  for  CRTs 
has  seen  less  investigation  than  reliability.  However,  it  is  generally 
agreed  that  content  validity  is  a  paramount  concern  in  CRT  development. 
According  to  Pcphaa  and  Huaek  (1969) ,  content  validity  is  determined  by  "a 
carefully  made  Judgment,  based  on  the  test's  apparent  relevance  to  the 
behaviors  legitimately  inferable  from  those  delimited  by  the  criterion." 

Rleln  and  Koaecoff  (1973)  offered  a  distinction  among  the  three  most 
common  methods  of  establishing  content  validity.  In  "systematic  test  devel- 
opment"  the  rationale  of  the  systematic  test  development  procedure  used  is 
explained,  indicating  why  it  should  yield  a  content  valid  test.  In  the 
method  of  "expert  judgment",  content  experts  are  given  objectives  and  items, 
and  are  then  asked  to  match  Items  to  objectives.  The  more  accurately , they 
can  do  this,  the  higher  the  content  validity  of  the  CRT  using  these  items 
to  measure  the  objectives*  The  third  method  uses  "item  analysis"  to  assess 
Internal  consistency  "and/or  see  whether  an  item  on  a  given  objective  correlates 
more  highly  with  other  items  for  this  objective  than  it  does  with  items  on 
other  objectives." 

Klein  and  Kosecoff  pointed  out,  however,  that  all  three  methods  are 
limited  by  dangers  Involved  with  internal  consistency  techniques  applied  to 
CRM,  and  by  possible  lack  of  score  variance.  They  noted,  though,  that. lack 
of  variance  is  a  problem  that  "usually  appears  to  be  more  theoretical  than 
actual",  and  that  "if  enough  students  are  tested,  then  one  will  discover 
sufficient  variance  in  the  levels  of  performance  and/or  in  the  time  it  takes 
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to  achieve  a  given  level". 

McFann  (1973)  viewed  the  content  validation  of  training  as  having  two 
major  dimensions.  The  first  is  the  role  of  the  human  within  a  general 
operating  system.  Generally,  this  is  defined  by  means  of  task  analysis. 

The  second  dimension  Involves  skills  and  knowledges  a  trainee  brings  with 
him  to  the  course;  training  content  can  then  be  viewed  as  a  residual  of 
what  must  still  be  imparted  to  the  trainee.  The  decision  of  what  to  Include 
in  training  must  also  be  tempered  by  management  orientation  to  cost  and 
effectiveness. 

McFann  stated  that  "decisions  made  on  the  units  or  procedures  by  which 
output  (course  completions)  are  to  be  evaluated,  has  an  Influence  on  valida¬ 
tion  of  training  content".  McFann  then  elaborated,  indicating  that  the 
decisions  to  which  he  referred  include  answers  to  questions  such  as: 

Will  a  normative  or  a  fixed  criterion  approach  be  employed? 

Vhat  is  the  form  of  the  evaluation?  Will  specific  task 

I 

performance  be  measured?  Will  transfer  to  other  areas  be 
emphasized?  Will  they  be  presented  in  a  problem  approach? 

In  other  words,  the  way(s)  in  which 'one  chooses  to  assess  instructional 
outcomes  will  affect  the  validation  of  instructional  procedures.  In  fact, 
McFann  stated,  "Answers  to  such  questions  will  lrfluence  training  content". 
McFann  saw  the  validation  of  training  content  as  a  dynamic.  Interactive 
process,  whereby  training  content  is  initially  d< termlned  and  then,  on  the 
basis  of  feedback  about  student  performance  on  tie  job.  Instructional  content 
as  well  as  instructional  methods  are  modified  to  improve  overall  effectiveness 

Edmonston,  Randall,  and  Oakland  (1972)  held  that  content  validation  is 


Contemporary  Views 
15 

central  to  CRT  development.  CRT  items  are  sampled  from  a  theoretically 
large  item  domain,  and  must  be  representations  of  specified  behavioral 
objectives. 

The  American  Psychological  Association's  Standards  for  Educational  & 
Psychological  Tests  (1974)  discussed  content  validity,  and  noted  that,  "An 
employer  cannot  justify  an  employment  test  on  grounds  of  content  validity 
if  he  cannot  demonstrate  that  the  content  universe  includes  all,  or  nearly 
all,  important  parts  of  the  job".  This  APA  document  also  discussed  construct 
validity  (which  it  found  most  applicable  in  research  studies),  and  criterion- 
related  validities.  These  latter  include  both  concurrent  and  predictive 
validities.  Criterion-related  validities  allow  inference  from  test  scores 
to  standing  on  other,  specified  criteria.  According  to  the  APA  standards, 
"Predictive  validity  Involves  a  time  interval  during  which  something  may 
happen.  .  .  [while]  concurrent  validity  reflects  only  the  status  quo  at  a 
particular  time." 

Swezey  and  Pearlstein  (1974),  have  stated  that  content  validity,  "is 
probably  the  single  best  wsy  of  assessing  whether  or  not  your  CRT  measures 
what  it  is  supposed  to  measure.  .  .[since  it]  is  a  matter  of  the  extent  to 
which  a  test  corresponds  with  its  objectives".  But,  they  also  noted  that 
content  validity  can  only  be  said  to  exist  when  a  test  consists  of  high- 
fidelity  items,  and  that  "Whether  or  not  your  test  has  content  validity,  you 
should  also  compute  statistical  estimites  of  concurrent  validity,  predictive 
validity,  or  both".  Swezey  and  Pearlstein  furnished  simple  techniques  for 
computing  concurrent  snd  predictive  validities,  both  of  which  employ  the 
0  coefficient  for  correlating  CRT  results  with  spproprlate,  other  measures 
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of  the  performance  in  question.  A  four-fold  matrix  (CRT,  other  measure, 
pass,  fall)  is  analyzed  via  0  ,  and  values  less  than  +.50  are  regarded  as 
Indicative  cf  unacceptable  concurrent  or  predictive  validity.  The  technique 
only  varies  in  that  it  is  applied  concurrently  in  the  one  case,  and  predic- 
tively  (i.e.,  several  months  intervene  between  CRT  and  other  measure)  in 
the  other  case.  Swezey  and  Pearlsteln  cautioned  that  the  other  measure 
must  be  suitable,  and  that  the  validation  sample  be  representative  and 
relatively  large. 

Hambleton  spu  Novick  (1971)  proposed  a  validity  theory  in  which  a  new 
test  serves  as  criterion  Hambleton  and  Novick  apply  a  decision-theoretic 
approach  to  both  reliability  and  validity.  Their  suggested  measure  of 
reliability  is  "the  proportion  of  times  that  the. same  decision  would  be 
made  with  the  two  parallel  Instruments”.  Hambleton  and  Novick  indicated  that 
a  decision-theoretic  approach  to  validity  takes  the  same  form  "except— that 
a  new  test  (T)  would  serve  as  criterion  and  the  qualifying  score  on  the 
second  test  need  not  correspond  with  the  qualifying  score  on  the  predictor 
CRT.  The  criterion  'test'  might  well  be  derived  from  performance  on  Che 
next  unit  of  instruction,  or  it  could  be  a  job-related  performance  criterion". 
Lack  of  correspondence  between  qualifying  scores  (i.e.,  cut-off  points)  does 
not  necessarily  make  a  predictor  test  invalid.  This  would  be  the  case  for 
norm-referenced  measurement  (NRM) ,  but  for  CRM  what  is  predicted  is  whether 
one  will  be  above  (or  below)  the  qualifying  score  on  a  criterion  test. 

Although  this  approach  appears  reasonable,  it  seems  that  different 
conclusions  may  be  reached  If  test  Y  were  a  job-related  criterion  as  opposed 
to  performance  on  the  next  unit  of  instruction.  The  different  conclusions 
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could*  however,  yield  approximations  of  convergent  and  divergent  validity. 
Validation  of  a  test  determined  by  correlating  it  with  another  teat  may, 
however  give  a  distinct  overestimate  of  "validity".  This  is  particularly 
true  when  tie  tasks  on  the  two  testa  are  similar. 

Edmonston  et  al.  (1972)  advocated  a  method  of  CRT  validation — which  they 
termed  a  criterion-oriented  approach — that  Includes  both  concurrent  and  pre¬ 
dictive  validity.  In  order  to  obtain  complete  information  shout  an  item 
and  the  objective  it  assesses,  the  relationship  of  a  CRT  to  other  measures 
should  be  considered  (l.e.t  ratings  by  teachers  or  performance  on  suitable 
NRT  measures).  Edmonston  et  al.  viewed  these  as  measures  of  concurrent  validity. 
In  addressing  problems  of  predictive  validation,  Edmonston  et  all  concurred 
with  Kennedy  (1972),  proposing  that  tests  of  curriculum  mastery  (which 
represent  higher  order  concepts  taught  within  several  curriculum  units)  be 
used  as  criteria  against  which  unit  test  items  would  be  assessed  as  to  their 
predictive  power.  In  addition,  unit  test  items  which  are  more  temporally 
proximate  should  agree  more  strongly  with  Mastery  Test  items  than  items 
sequenced  earlier.  Final  verification  of  this  scheme  of  validity  determination 
requires  factorlally  pure  items,  and  this  may  be  a  bit  too  much  to  ask  of 
item  writers. 

Edmonston  et  al,  endorsed  an  approach  to  construct  validity  initially 
put  forth  by  Nunnally  (1967).  Nunnally  pointed  out  that  constructs  are 
abstract  variables,  and  that  the  more  measures  one  obtains  relating  to  a 
construct,  the  more  explicitly  defined  that  construct  becomes.  The  "Internal 
network"  is,  in  Nunnally's  terms,  an  internal  structure  based  on  the 
"correlations  among  the  measures  of  observables  in  a  particular  set." 


This 
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Internal  structure  may  show  that  measures  are  related  to  the  same  thing — 
which  is  evidence  that  the  set  may  be  interpreted  as  a  unitary  construct. 

Or,  it  may  turn  out  that  the  structure  indicates  "that  two  or  more  things 
are  being  measured  by  members  of  the  set",  in  which  case,  a  unitary  construct 
is  no  longer  sufficient.  The  example  that  Nunnally  offered  Is  chat  the 
internal  structure  of  a  set  of  measures  purportedly  measuring  anxiety,  is 
such  that  it  is  clear  that  two  types  of  anxiety  are  actually  being  measured. 
Thus,  it  is  appropriate  to  break  the  original  set  of  anxiety  measures  into 
two  sets,  anxiety  type- one,  and  anxiety  type  two,  corresponding  to  those 
variables  which  actually  lntercorrelate  highly. 

Nunnally  concluded  that  "If  all  the  correlations  among  members  of  the 
set  are  very  low,  it  is  illogical  to  continue  speaking  of  the  variables 
as  constituting  a  set.  .  ."  In  Nunnally* s  view,  the  measurement  and  valida¬ 
tion  of  a  construct  involve  the  determination  of  an  internal  network  among 
a  set  of  measures,  and  the  consequent  formation  of  a  network  of  probability 
statements.  This  notion  is  similar  to  Cronbach  and  Meehl's  (1955)  enuncia¬ 
tion. of  the  need  for  a  "nomological  network"  with  which  to  validate  a  construct. 
Edmonston  et  al.  indicated  that  the  "specification  of  a  hierarchy  of  learning 
sets  among  items  would  r*em  to  be  the  ultimate  goal  of  construct  validation 
procedures,  enabling  the  development  of  Internal  and  cross  structures  between 
items  and  the  consequent  understanding  of  t  .«£  inter-relationships  of  ali 
curriculum  areas".  This  concept  would  be  difficult  to  implement,  as  the 
construction  of  learning  sets  is  not  an  easy  procedure.  Also,  difficulty 
can  be  expected  in  attempting  to  establish  a  network  of  relationships  suffi¬ 
cient  to  completely  define  a  construct. 
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In  Roudabush's  (1973)  view  of  validity,  CRT  items  are  designed  to  sample 
the  specified  domain  of  behavior  as  purely  as  possible,  and  are  then  tried 
out  to  determine  their  sensitivity  to  Instruction.  A  2  x  2  contingency  table 
containing  post-test  and  pre-test  outcomes  is  the  basis  for  analysis: 


Pre-test 


Post-test 


- 

+ 

- 

fl 

f2 

fl  +  f2 

+ 

f3 

f4 

f3  +  f4 

fl  +  f3 

f2  +  f4 

f failed  both  pre-  and  post- 
f 2*  failed  pre-,  passed  post- 
f 2“  passed  pre-,  failed  post¬ 
il*  passed  both  pre-  and  post- 


harlcs  and  Noll  (1967)  assumed  that  f^  is  due  to  guessing,  and  derived  a  sensi¬ 
tivity  index  named  (s) ,  which  is  simply  the  proportion  of  cases  missing  the 

item  on  the  pre-test  and  passing  it  on  the  post-test,  with  a  correction  for 
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Roudabush  (1973),  however,  found  that  to  derive  a  "reasonably  reliable" 
value  for  the  index,  there  should  be  at  least  50  cases  who  missed  the  item  at 
pre-test  (f^),  while  if  the  f^  cell  is  high,,  the  index  will  have  little  value 
(neither  will  the  item) .  A  problem  here  may  be  ensuring  that  different  but 
parallel  items  are  used  for  pre-  and  post-teats.  This  problem  is  a  practical 
one,  but  is  particularly  acute  when  complex  content  domains  are  involved. 

Guion  (1974)  explored  the  relationships  between  job-relatedness  and 
several  validities,  and  used  employment  tests — tests  used  to  predict  who  is 
suitable  for  hiring  and  subsequent  training — as  an  example.  He  suggested, 

"that  an  employment  test  may  provide  a  basis  for  inferences  that  have  criterion- 
related  validity,  or  construct  validity,  or  content  validity,  or  all  of  these, 
and  still  not  be  job  related".  Guion  viewed  job-relatedness  "as  the  extent  to 
which  the  hypothesis  of  a  relationship  between  the  hiring  requirement  and  job. 
behavior  can  be  accepted  as  logical".  Guion  concluded  that  one  new  technique 
that  might  help  improve  psychological  measurement,  bridging  the  job  relatedness— 
validities  chasm  "is  the  content-referenced  measurement  of  mastery". 

These  treatments  of  CRT  validity  all  exhibit  difficulties  that  might  . 
prove  insurmountable  to  a  test  constructor  dealing  with  "real  world"  problems. 
Content  validity  however,  is  of  primary  Importance  in  CRM  and  can  be  reasonably 
ensured  by  careful  attention  to  objective  development.  Construct  validity 
will  probably  prove  elusive,  if  only  due  to  the  complexity  of  operations  and 
measures  required  for  its  demons tr 9 t ion.  Predictive  and  concurrent  validities 
appear  practicable  in  many  situations. 
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Construction  Methodology 

NRTs  are  designed  primarily  to  measure  individual  differences.  The 
meaning  which  can  be  attached  to  a  score  depends  upon  a  comparison  of  that 
score  to  a  relevant  norm  distribution.  A  NRT  is  constructed  to  maximize 
teat  score  variability,  since  such  a  test  is  likely  to  produce  less  errors 
in  ordering  individuals  on  the  measured  ability.  Since  NRTs  are  often  used 
for  selection  end  classification  purposes,  minimizing  errors  of  ordering 
is  extremely  Important. 

NRTs  are  constructed  using  traditional  item  analysis  procedures.  It  is 
partly  because  of  this  chat  test  scores  cannot  be  interpreted  relative  to 
some  well-defined  content  domain.  That  is,  items  are  selected  to  produce 
tests  with  desired  statistical  properties  (e.g.,  difficulty  levels  around  .5), 
rather  than  to  be  representative  of  a  content  domain.  •  , 

CRTs,  on  the  other  hand,  tend  to  have  restricted  ranges  of  variance. 

Thus,  they  are  not  easily  subjected  to  traditional  item  analysis  procedures. 
There  are,  however ,  ways  of  obtaining  the  necessary  range  of  variance. 
Haladyna  (1974)  performed  a  study  to  demonstrate  the  feasibility  of  combining 
pre-  and  post-instruction  CRT  scores  in  order  to  increase  score  variance, 
thereby  permitting  the  use  of  classical  psychometric  methodology  for  item 
analysis  and  test  reliability.  He  administered  units  of  Instruction  and  CRTs 
to  189  undergraduate  education  students,  and  computed  test  and  item  statistics 
for  three  samples: 

1.  Preinstruction  students,  representing  a  nonmastery  population, 

2.  Post  Instruction  students,  representing  a  mastery  population,  and 

3.  The  above  two  samples  combined^ 
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Haladyna  found  that  combining  the  samples  greatly  increased  variance  over 
either  of  the  samples  alone,  and  that  point  biserial  discrimination  indexes 
computed  for  the  combined  samples  appeared  to  be  "the  most  efficient  method 
for  obtaining  Information  about  the  adequacy  of  the  CR  test  items". 

Woodson  (1974a  and  1974b)  has  argued  that  item  (and  test)  variance  is  a 
necessary  condition  for  item  selection  in  CRT  development,  as  well  as  in  NRT 
development.  He  noted  that  variance  in  NRM  results  from  observations  on 
random  samples  of  Individuals  in  a  population,  while  CRM  variance  results  from 
observations  on  a  sample  representative  of  the  range  of  the  characteristic 
measured.  This  range  of  the  characteristic  can  vary  from  no  one  passing 
any  items  (as  in  pre-instruction  testing)  to  everyone  passing  all  items  (as 
in  the  ideal  postinstructlodal  outcome) .  Noting  that  "The  better  an  item 
discriminates  among  instances  of  the  characteristic  within  the  range  of 
Interest,  the  more  information  the  item  gives  us",  Woodson  concluded  that 
"If  our  measurement  devices  are  sufficiently  precise,  individuals  will  be 
ordered  on  an  appropriate  scale".  Woodson's  concept  of  CRT  variance,  and  its 
value  in  developing  CRTs  capable  of  ordering  individuals,  has  yet  to  be 
empirically  verified,  though. 

Item  homogeneity  is  also  much  sought  in  development  of  NRTs.  The  ulti¬ 
mate  purpose  is  to  spread  out  individuals  by  maxim! zing  the  discriminating 
power  of  each  item.  The  emphasis  is  on  comparing  an  individual's  response 
to  the  responses  of  others.  The  interest  is  not  in  absolute  measurement  of 
individual  skills,  as  in  CRTs,  but  only  in  relative  comparison.  Thus,  item 
homogeneity  is  not  directly  applicable  to  CRM. 
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Nevertheless,  item  analysis  is  an  important  tool  in  test  construction 
and  therefore  has  application  to  the  construction  of  CRTs.  Although  content 
validity  is  an  important  characteristic  for  a  CRT  item,  other  considerations 
having  to  do  with  sensitivity  and  discriminating  power  of  an  item  are  also 
important.  These  features  are  important  in  evaluating  instruction  and  in 
ensuring  correct  decisions  shout  an  individual's  progress  through  instruction. 

In  CRT  development,  an  item  difficulty  index  may  be  useful  for  selecting 
items.  However,  item  difficulty  is  used  differently  than  in  NRM.  If  the 
content  domain  is  carefully  specified,  test  items  written  to  measure  accom¬ 
plishment  of  an  objective  should  also  be  carefully  specified  and  closely 
associated  with  the  objective.  Thus,  all  items  associated  with  the  same 
objective  should  be  answered  correctly  by  approximately  the  same  proportion 
of  examinees  in  a  group.  Items  which  differ  greatly  should  be  examined 
carefully  to  determine  if  they  coincide  with  the  intent  of  the  objectives. 

Similarly,  item  discrimination  Indexes  can  be  useful  in  CRT  development. 
Negative  discrimination  Indexes  warn  that  CRT  items  need  modification,  or 
that  the  Instructional  process  is  faulty.  A  negative  index  would  be  indica¬ 
tive  of  a  high  proportion  of  "false  negatives". 

Klein  and  Kosecoff  (1973)  discussed  item  analysis  as  a  means  for  improving 
CRT  item  quality,  and  noted  that  selection  of  "good"  items  varies  as  a  function 
of  item  analysis  method.  They  then  described  two  concepts  underlying  four 
general  types  of  item  analysis  methods  used  in  the  development  of  CRTs: 
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1.  Sensitivity  to  instruction — "good"  items  are  failed  prior  to  the 
relevant  Instruction,  and  passed  following  instruction,  and 

2.  Discrimination  among  items  on  the  basis  of  internal  consistency — 
"good"  items  discriminate  between  those  who  do  well  on  the  test  as 
a  whole  (or  on  some  external  criterion)  and  those  who  don't. 

Klein  and  Kosecoff  delineated  the  four  general  item  analytic  approaches 
based  on  these  two  concepts:  Comparison  group  (masters  versus  non-masters, 
or  have-recelved-instruction  versus  have-not-recelyed-instructlon) ;  single 
group  using  pre-  and  post-instruction  tests;  single  group  using  posttest 
only;  and  single  group  with  repeated  measures  (test  is  administered  until 
mastery  is  achieved  on  all  items,  and  pass-fall  patterns  are  examined  to 
detect  reversals).  Klein  and  Kosecoff 'a  analysis  indicated  that  the  latter 
two  methods  are  less  applicable  than  the  first  two.  They  also  cautioned 
that,  when  the  first  type  of  method  is  used,  both  groups  must  be  equated  as 
to  general  intellectual  ability,  or  as  to  other  factors  that  might  contaminate 
the  comparison. 

One  attempt  to  use  item  analysis  techniques  to  develop  test  evaluation 
indexes  was  undertaken  by  Ivens  (1970).  Ivens  has  defined  reliability  indexes 
based  on  the  concept  of  wlthln-subject  score  equivalence.  Item  reliability 
is  defined  as  the  proportion  of  subjects  whose  item  scores  are  the  same  on 
the  post-test,  as  on  either  a  retest  or  a  parallel  form.  Score  reliability 
is  then  defined  as  the  average  item  reliability. 

Rahmlow,  Matthews,  and  Jung  (1970)  suggested  that  the  function  of  a  dis¬ 
crimination  index  in  a  CRT  is  primarily  that  of  Indicating  item  homogeneity 
with  respect  to  the  specific  instructional  objective  measured.  These  authors 
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focused  on  shifts  in' item  difficulty  from  pre-instruction  to  post-instruction 

Helmstadter  (1972).  compared  the  following  alternative  indexes  of  item 
usefulness: 

1.  Item  discrimination  based  on  high  and  low  groups  on  a  post-instructional 
measure . 

2.  Shift  in  item  difficulty  from  pre-  to  post-instruction. 

3.  Item  discrimination  based  on  pre-  and  post-test  performance. 

Shift  in  item  difficulty  from  pre-  to  post-instruction  produced  results 
significantly  more  similar  to  the  pre-post  discrimination  index,  than  did 
the  high-low  group,  post-test  discrimination  index  comparison. 

Helmstadter  also  compared  an  item  discrimination  lnd 'X  applied  to  pre- 
and  post-instruction  with  difficulty  Indexes  derived  in  the  same  fashion. 

His  findings  resulted  in  the  conclusion  that  caution  should  be  observed  when 
using  traditional  item  analysis  procedures  with  CRTs.  In  a  similar  finding, 
Roudabush  (1973)  described  a  situation  where  use  of  traditional  item  statistics 
would  have  resulted  in  some  objectives  being  over-represented  while  others 
would  not  be  represented  at  all. 

Ozenne  (1971)  has  developed  an  elaborate  model  of  subject  response  which 
he  used  to  derive  an  index  of  sensitivity.  In  this  formulation,  the  sensitivity 
of  a  group  of  comparable  measures,  given  to  a  sample  of  subjects  before  and 
after  Instruction,  is  defined  as  the  variance  due  to  the  instructional  effect 
divided  by  the  sum  of  the  variance  due  . to  the  instructional  effect  and  error 
variance.  This  index  was  however,  developed  for  a  severely  restricted  sample 
in  order  to  allow  an  analysis  of  variance  treatment.  Further  development  is 
indicated  before  the  technique  has  general  usefulness  for  sensitivity  measure¬ 
ment,  or  item  selection. 
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New  procedures  have  been  developed  for  item  analysis  of  specific  CRTs, 
but  evidence  as  to  their  generallzablllty  is  lacking.  If  item  analytic 
proc^d^'es  are  to  be  used  in  evaluating  CRTs,  one  must  consider  what  sort 
of  f  core  Is  produced  by  the  item.  The  most  typical  scoring  involves  a  pass- 
fall  dichotomy.  A  CRT  item  can  result  in  two  types  of  incorrect  decisions. 
Roudabush  and  Green  (1972)  referred  to  these  errors  as  "false  positives"  and 
"false  negatives".  In  this  view,  reliability  is  concerned  with  the  CRT's 
ability  to  consistently  make  the  same  decision.  Consequently,  validity 
becomes  the  ability  of  a  CRT  to  make  the  "right"  decision  (l.e.,  avoiding 
false  negatives  and  false  positives).  In  these  authors'  view,  the  adequacy 
of  a  CRT  is  determined  by  its  ability  to  discriminate  consistently  and 
appropriately  over  large  numbers  of  items. 

Swezey  anJ  Pearlstein  (1974)  suggested  comparing  "masters"  and  "non¬ 
masters"  as  to  pass-fall  on  items,  thereby  circumventing  the  internal  con¬ 
sistency  problem.  "Masters"  and  "non-masters"  can  be  defined  either  in  terms 
of  completlon/noncompletion  of  the  relevant  instruction,  or  in  terms  of  skill 
level  on  some  external  criterion;  i.e.,  the  "master"  has  had  considerable 
experience  in  the  subject  area,  while  the  non-master  has  not.  A  0  coeffi¬ 
cient  is  computed  for  each  item  ("master-nonmaster",  pass-fail),  and  a  value 
of  less  than  +.30  indicates  an  item  of  questionable  utility. 

Carver  (1970)  proposed  two  procedures  to  assess  reliability  of  CRT 
items.  Por  a  single  form,  he  suggested  comparing  the  percentage  meeting 
criterion  level  in  one  group  to  the  same  percentage  in  another  "similar" 
group.  For  homogeneous  sets  he  recommended  using  one  group  and  comparing 
the  percentages  who  meet  th«s  criterion  on  all  items. 
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Meredith  and  Sabers  (1972)  pointed  out  that  the  way  in  which  two  CRT 
items,  whether  identical  or  parallel,  identify  the  same  individual  with 
regard  to  his  attainment  of  criterion  level  must  be  determined.  Ulth  regard 
to  item  analysis  procedures,  if  a  CRT  item  is  administered  before  and  after 
instruction,  and  does  not  discriminate,  there  are  alternatives  to  labeling 
it  unreliable.  A  non-discriminating  item  may  simply  be  an  invalid  measure 
of  the  objective  or  it  may  indicate  that  the  Instruction  Itself  is  inadequate 
or  unnecessary.  Meredith  and  Sabers  suggested  the  use  of  a  matrix  consisting 
of  the  pass-fail  decisions  of  two  CRTs.  By  defining  two  CRT  items  as  being 
the  same  measures,  it  is  possible  to  examine. test/re-test  reliability.  With¬ 
out  time  intervening  between  the  measures  however,  reliability  is  of  the 
concurrent  variety.  In  addition,  problems  exist  with  the  acceptability  of 
defining  two  CRTs  as  the  same.  Considerable  confusion  is  evidenced  in  the 
use  of  "s/me"  and  parallel  forms  without  formal  definitions.  Similarly,  it 
was  stated  that  if  oue  CRT  item  is  a  "criterion  measure",  then  the  validity 
of  the  other  CRT  can  be  determined.  By  definition,  both  are  criterion  mea¬ 
sures,  and  if  one  is  external  to  the  instructional  domain,  then  it  is  not  a 
CRT  item  in  che  same  sense.  Various  coefficients  were  presented,  but  diffi¬ 
culties  in  definition  limit  their  usefulness. 

Fidelity 

Fredericksen  (1962)  has  proposed  a  hierarchical  model  for  describing 
levels  of  fidelity  in  performance  evaluation.  The  model  uses  six  categories: 

1.  Solicit  opinions.  This  category,  the  lowest  level,  may  often  miss  a 
crucial  question  (e.g. ,  to  what  extent  has  the  behavior  of  trainees 
been  modified  as  a  function  of  the  instructional  process?) . 
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2. ,  Administer  attitude  scales.  This  technique,  although  psychometri- 
cally  refined  via  the  work  of  Thurstone,  Likert,  Guttman,  and  others, 
assesses  primarily  a  psychological  concept  (attitude)  which  can 
only  be  presumed  to  be  concomitant  with  performance. 

3.  Measure  knowledge.  This  is  the  most  commonly  used  method  of  assessing 
achievement.  This  technique  is  usually  considered  adequate  only  if 
the  training  objective  is  to  teach  knowledge,  or  if  highly  defined, 
fixed  procedure  tasks  are  Involved. 

4.  Elicit  related  behavior.  This  approach  is  often  used  in  situations 
where  practicality  dictates  observation  of  behavior  thought  to  be 
logically  related  to  the  criterion  behavior. 

5.  Elicit  "What  Would  I  Do"  behavior.  This  method  involves  presentation 
of  brief  descriptions  or  scenarios  of  nroblem  situations  under 
simulated  predesigned  conditions;  the  subject  is  required  to  indicate 
how  he  would  solve  the  problem  if  he  were  in  the  situation. 

6.  Elicit  lifelike  behavior.  Assessment  under  conditions  which  approach 
the  realism  of  the  real  situation. 

Measurement  at  any  of  these  six  levels  possesses  both  advantages  and 
disadvantages.  An  optimal  solution  would  be  to  assess , individual  performance 
at  the  highest  possible  level  of  fidelity.  Unfortunately,  deriving  observed 
performance  data  may  Involve  a  subjective  (rating)  scale,  thereby  requiring 
a  subjectivity  vs.  fidelity  tradeoff.  In  order  to  minimize  subjectivity, 
it  may  be  necesoary  to  decrease  the  level  of  fidelity  so  that  more  objective 
measurements  (such  as  time  and  errors)  can  be  obtained.  These  measures  can 
be  conceptualized  as  surrogates  that  in  some  sense  embody  real  criteria. 
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but  have  the  virtue  of  measurability  (Rapp  et  al.,  1970).  An  actual  Increase 
in  overall  criterion  adequacy  may  result  from  a  gain  in  objectivity  which 
compensates  for  a  corresponding  loss  in  fidelity. 

The  question  of  fidelity  addresses  the  issue  of  how  much  the  test  should 
resemble  actual  performance.  Fidelity  is  not  usually  at  issue  in  NRM  and 
has  its  primary  application  in  criterion-referenced  performance  tests. 

There  are  often  trades  to  be  made  between  fidelity  and  cost.  But  a  more 
salient  issue  is  how  to  modify  fidelity  to  satisfy  needs  of  the  testing 
situation,  while  retaining  the  essential  stimuli  and  demand  characteristics 
of  the  actual  performance  situation.  Swezey  and  Pearlstein  (1974)  suggested 
that  when  creating  items,  the  CRT  developer  "Select  the  format  that  best 
approximates  the  behavior  specified  by  the  objective",  and  that  "will  permit 
the  highest  level  of  fidelity  practicable". 

Osborn  (1970)  addressed  problems  of  finding  efficient  alternatives  to 
work  sample  tests.  Osborn  was  concerned  with  developing  a  methodology  to 
allow  derlvatioh  of  cheaper  procedures  while  preserving  content  validity. 

There  are  many  situations  where  job  sample  tests  are  not  feasible,  and  job- 
knowledge  tests  are  not  relevant.  The  existance  of  intermediate  or  "synthetic" 
measures  would  be  a  great  boon  to  evaluating  performance  in  these  situations^ 
however,  specific  methods  for  developing  such  measures  sire  lacking. 

Osborn  gave  a  brief  outline  of  a  method  for  developing  synthetic  measures. 
He  presented  a  two-way  matrix  defined  by  methods  of  testing  terminal  performance 
(simple  to  complex)  and  component  (enabling)  behaviors.  This  matrix  serves  as 
a  decision-making  aid  by  allowing  the  test  constructor  to  choose  the  most 
cost-effective  test  method  for  each  behavior.  Tradeoffs  must  be  made  by 
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the  test  constructor  among  test  relevance,  obtaining  diagnostic  performance 
data,  ease  of  administration,  and  cost.  Osborn's  notions  are  intriguing  but 
more  developmental  work  is  needed  before  a  workable  method  for  deriving 
synthetic  performance  tests  is  available. 

Vlneberg  and  Taylor  (1972)  addressed  a  topic  close  to  the  fidelity  issue: 
the  extent  to  which  job  knowledge  tests  can  be  substituted  for  perfor¬ 
mance  test?.  Practical  considerations  have  often  dictated  the  use  of  paper- 
and-pencil  Job  knowledge  tests  because  th> j  _re  simple  and  economical  to 
administer  and  easy  to  score.  However,  the  use  of  paper-and-pencll  tests 
to  provide  Indexes  of  performance  is  considered  to  be  poor  practice. 

'HumRRO  research  under  Work  Unit  UTILITY  compared  the  proficiency  of 
Army  men  at  different  ability  levels  and  with  different  amounts  of  job 
experience.  This  work  provided  Vineberg  and  Taylor  the  opportunity  tc 
examine  relationships  among  Job  sample  test  scores  and  job  knowledge  test 
scores  in  four  U.S.  Army  jobs  that  varied  greatly  in  type  and  complexity. 
Vineberg  and  Taylor  found  that  job  knowledge  tests  are  valid  for  measuring 
proficiency  in  jobs  where"  (1)  skill  components  are  minimal,  and  (2)  Job 
knowledge  tests  are  carefully  constructed  to  measure  only  information  directly 
relevant  to  performing  the  Job  at  hand.  Given  the  high  costs  of  obtaining 
performance  data,  these  findings  indicate  that  job  knowledge  tests  are 
indicated  where  careful  job  analysis  has  determined  that  skill  requirements 
are  minimal. 

In  a  similar  study,  Engel  and  Rehder  (1970)  compared  peer  ratings,  a 
job  knowledge  test,  and  a  work-sample  test.  While  the  knowledge  test  was 
acceptably  reliable,  it  lacked  validity,  and  reading  ability  tended  to  enter 
into  the  score.  Peer  ratings  were  Judged  to  have  unacceptable  validity  and 
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were  essentially  uncorrelated  with  the  written  test.  The  troubleshooting 
items  on  the  written  test  exhibited  a  moderate  level  of  validity,  while 
the  corrective-action  iteas  had  little  validity.  Finally,  Engel  and  Rehder 
noted  that  the  work-sample  test  is  the  most  costly  and  difficult  to  administer, 
while  peer  ratings  and  written  tests  were  less  costly  and  easier  to  administer. 

Process  vs.  Product  Measurement 

Osborn  (1973a  and  1973b)  discussed  an  important  topic  directly  related 
to  CRT  validity  and  fidelity.  Osborn  pointed  out  that  task  outcomes  and 
products  are  often  used  to  assess  student  performance  while  measures  of 
how  the  tasks  are  done  (processes)  generally  pertain  to  the  diagnosis  of 
instructional  systems.  Time  or  cost  factors  sometimes  preclude  the  use  of 
product  aeasures,  thus  leaving  process  measures  as  the  only  available  criteria. 
There  are  cases  where  this  focus  on  process  is  legitimate  and  useful,  but 
aany  where  it  is  not.  Osborn  developed  three  classes  of  tasks  to  illustrate 
what  the  relative  roles  of  product  and  process  measurement  should  be: 

"1.  Tasks  where  the  product  is  the  process. 

2.  Tasks  in  which  the  product  always  follows  from  the  process. 

3.  Tasks  in  which  the  product  may  follow  from  the  process.” 

A  relatively  few  tasks  can  be  classified  as  the  first  type.  Osborn 
offered  gymnastic  exercises  and  springboard  diving  as  examples.  More  tasks 
belong  to  the  second  classification,  fixed  procedure  tasks.  In  these  tasks, 
if  the  process  is  performed  correctly,  the  product  follows.  The  largest 
single  class  of  tasks  is  of  the  third  type.  For  tasks  of  this  last  type, 
the  process  may  appear  to  have  been  correctly  carried  out  for  cases  in  which 
the  product  was  not  attained.  Osborn  offered  two  reasons  for  this:  either. 
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(1)  "we  are  unable  to  fully  specify  the  necessary  and  sufficient  steps  in 
task  performance",  or  (2)  "we  do  not  or  cannot  accurately  measure  them".  An 
example  of  aim-firing  a  rifle  was  given  as  an  Illustration  that  there  la  no 
guarantee  of  acceptable  markmanship  even  if  all  procedures  are  followed.  In 
this  case,  process  measurement  is  not  an  adequate  substitute  for  product 
measurement . 

For  tasks  of  the  first  two  types,  Osborn  concludes  that  it  really  doesn't 
matter  which  measure  is  used  to  assess  proficiency.  But  for  tasks  of  the 
third  type,  product  measurement  is  indicated.  There  are  however,  a  number 
of  type  3  tasks  where  product  measurement  is  impractical  because  of  cost, 
danger,  or  other  constraints.  In  these  cases,  process  measures  are  substituted 
with  a  resulting  decrement  to  the  validity  of  the  measure.  Osborn  poses  a 
.salient  quesclon  for  the  test  developer:  "If  I  use  only  a  process  measure 
to  test  a  man's  achievement  on  a  task,  haw  certain  can  I  be  from  this  process 
score  that  he  would  also  be  able  to  achieve  the  product  or  outcome  of  the 
task?"  Osborn  holds  that  "Where  the  degree  of  certalqty  is  substantially 
less  than  that  expected  by  errors  of  measurement,  the  test  developer  should 
pause  and  reconsider  ways  in  which  time  and  resource  limitations  could  be 
compromised  in  achieving  an  approximation  to  product  measurement".  Osborn 
concluded  by  noting:  "The  accomplishment  of  product  measurement  is  not  always 
a  simple  matter;  but  it  is  a  demanding,  and  essential  goal  to  be  pursued  by 
the  performance  test  developer  if  his  products  are  to  be  relevant  to  real 
world  behavior." 

Swezey  and  Pearlstein  (1974)  have  also  addressed  process  versus  product 
measurement,  and  assist  versus  non-interference  methods  of  scoring  in  CRT 
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development.  They  recommended  process  measurement  in  addition  to,  or  Instead 
of,  product  measurement  when:  Diagnostic  information  is  desired;  critical 
points  in  the  process,  if  misperformed,  may  cause  injury  or  damage;  additional 
scores  are  needed  on  a  particular  task;  the  product  always  follows  from  the 
process;  and  there  is  no  product  at  the  end  of  the  process. 

Another  issue  important  to  the  construction  of  complex  CRTs,  is  band¬ 
width  fidelity  (Cronbach  and  Gleser,  1965),  i.e.,  the  question  of  whether 
to  obtain  precise  information  about  a  small  number  of  competencies  or  less 
precise  information  about  a  larger  number.  Hambelton  and  Novlck  (1971) 
conclude  that  the  problem  of  fixing  the  lengths  of  subscales  to  maximize 
percentage  of  correct  decisions  on  the  basis  of  test  results,  has  yet  to  be 
resolved  or  even  satisfactorily  defined. 

Issues  Related  to  CRT  Construction 

Although  construction  methodology  for  NRTs  is  well-established  and 
highly  specified,  construction  techniques  for  CRTs  have  been  less  well-1 
specified.  There  have  been,  however,  several  attempts  to  formalize  the  CRT 
construction  process.,  Ebel  .(1962)  describes  the  development  of  a  criterion-  ' 
referenced  tjst  concerning  knowledge  of  word  meanings.  Three  steps  were 
involved: 

1.  Specification  of  the  universe  to  which  generalization  is  desired. 

2.  A  systematic  plan  for  sampling  froa  the  universe., 

3»  A  Standardized  method  of  item  development. 

These  characteristics  together  serve  to  define  the  meaning  of  test  scores.  . 

Flanagan  (1962)  indicates  that  a  variant  of  Ebel’s  procedure  was  uned 
in  project  TALENT.  Tests  used  in  the  areas  bf  spelling,  vocabulary,  an<! 
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reading  were  not  based  on  specific  objectives,  but  developed  by  systematically 
sampling  a  relevant  domain.  Fremer  and  Anastasio  (1969)  also  put  forth  a 
method  for  systematically  generating  spelling  items  from  a  specified  domain. 

Csburn  (1968)  notes  two  prerequisites  for  Inferences  drawn  about  a 
domain  of  knowledge  from  performance  on  a  collection  of  ifems: 

1.  All  items  that  could  possibly  appear  on  a  test  should  be  specified 
in  advance . 

2.  The  items  in  a  particular  test  should  be  selected  at  random  from 
the  content  universe. 

It  is  rarely  feasible  to  satisfy  the  first  prerequisite  for  complex 
behavior  domains.  However,  the  problem  of  testing  all  items  can  be  overcome, 
at  least  in  higMy-specif led  content  areas,  by  the  use  of  an  item  form 
(Hively,  1968,  1973;  Osburn,  1968).  The  item  form  generally  has  the  following 
characteristics  (Osburn,  1968): 

1.  It  generates  Items  with  a  fixed  syntactical  structure. 

2.  It  contains  one  or  more  variable  elements. 

3.  It  defines  a  class  of  item  sentences  by  specifying  the  replace¬ 
ment  sets  for  the  variable  elements. 

But  Klein  and  Kosecoff  (1973)  have  noted  that  even  very  specific  objectives 

e.g.,  "compute  the  correct  product  of  two  single  digit  numerals  greater  than 

0,  the  product  not  exceeding  20"— may  yield  possible  item  pools  "of  well 

over  several  thousand  items".  In  the  example  objective  above,  there  are  29 

pairs  of  possible  numerals  times  at  least  10  different  suitable  item  types 
6 

(e.g.  6  x  2,  vs  x2,  vs  (6)(2),  vs  6  x  _  •  12,  etc.)  times  variations  in  numeral 
sequence  (e.g.,  6  x  2,  2x6)  times  variations  in  item  format  (e.g. ,  multiple 
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choice,  fill- in- the-b lank,  etc.)  times  variations  in  presentation  mode  (e.g., 
oral  or  written)  times  variations  in  mode  of  response  (oral  or  written) , 
equaling  4,640  potential  items,  assuming  only  two  types  for  each  of  the 
variations  indicated.  So,  item  forms  require  careful  consideration,  even  for 
highly  defined  areas  such  as  mathematics. 

Shoemaker  and  Osburn  (1969)  describe  a  computer  program  capable  of  gener¬ 
ating  both  random  and  stratified  random  parallel  tests  from  a  well-defined 
and  rule-bound  population.  Results  have  led  to  the  conclusion  that  difficulties 
in  defining  test  construction  processes  are  directly  related  to  the  complexity 
of  the  behavior  the  test  is  designed  to  assess  (Jackson,  1970) .  Where  the 
domain  is  easily  specified  (as  in  spelling)  the  construction  process  i3 
simplified.  Jackson  (1970)  concludes,  "For  complex  behavior  domains,  it 
appears  that  at  least  until  explicit  models  stated  in  measurable  terms  are 
developed,  a  degree  of  subjectivity  in  test  construction  (and  attendant  pop¬ 
ulation-referenced  scaling)  will  be  required."  The  best  approach  appears  to 
be  the  use  of  a  detailed  test  specification  which  relates  test  item  develop¬ 
ment  processes  to  behavior. 

Edgetton  (1974)  has  suggested  that  the  relationships  among  instructional 
methods,  course  content,  and  item  format  have  not  been  adequately  explored. 

Item  format  should  require  thinking  and/or  performing  in  the  patterns  sought 
by  the  instructional  methods.  If  the  instruction  is  aimed  at  problem  solving, 
then  the  items  should  address  problem  solving  tasks  and  not,  for  example, 
knowledge  about  the  required  background  content.  Edgerton  suggests  that  if 
one  mixes  styles  of  items  in  the  same  test,  cne  runs  the  risk  of  measuring 
"test  taking  skill"  Instead  of  subject  matter  competence,  la  a  practical 
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application,  Osborn  (1973)  suggests  a  fourteen-step  procedure  in  the  course 
of  developing  tests  for  training  evaluation.  Swezey  and  Pearlsteln  (1974) 
have  suggested  that  the  following  factors  must  be  considered  when  designing 
a  test  plan  to  guide  item  development: 

1.  Overcoming  practical  constraints  in  test  administration  (time, 
manpower,  and  facilities  availability,  etc.)  by  selecting  among 
objectives  (randomly  testing  non-critlcal  objectives)  or  modifying 
obj ectlves. 

2.  Planning  item  format  and  level  of  fidelity. 

3.  Sampling  items  within  objectives. 

4.  Sampling  among  multiple  conditions. 

5.  Deciding  how  many  items  to  include  on  the  test. 

Mastery  Learning 

Besel  (1973  a,  b)  contends  that  norm-group  performance  data  are  useful 
for  the  construction  of  CRTs.  Besel  defines  a  CRT  as  a  set  of  items  sampled 
from  a  domain  which  is  judged  to  be  an  adequate  representation  of  an  instruction¬ 
al  objective.  The  domain  should  be  described  in  sufficient  detail  to  allow 
Independent  test  developers  to  generate  equivalent  items  measuring  the  same 
content  in  an  equally  reliable  fashion. 

Besel  recommends  a  "Mastery  Learning  Test  Model"  to  provide  an  appropriate 
algorithm  for  support  of  mastery /non-mastery  decisions.  The  Model  and  its 
underlying  true  score  theory,  is  related  to  a  notion  developed  by  Emrick  (1971). 
Emrick  assumed  that  measurement  error  was  attributable  to  two  sources:  a, 
the  probability  that  a  non-master  will  correctly  answer  an  item  ("false 
positive")  and  B ,  the  probability  that  a  master  will  give  ah  Incorrect  answer 
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("false  negative").  Emrick's  model  assumes  that  all  item  difficulties  and 
inter-item  correlations  are  equal,  a  somewhat  difficult  assumption.  Besel 
(1973  a,  b)  developed  algorithms  for  estimating  a  and  B.  Three  data  sources 
are  required: 

1.  Item  difficulties 

2.  Inter-item  covariance 

t 

3.  Score  histograms 

Besel  reports  that  "the  usage  of  an  independent  estimate  of  the  propor¬ 
tion  of  students  reaching  mastery  resulted  in  Improved  stability  of  Mastery 
Learning  parameters"  in  a  tryout  sample.  Improved  stability,  of  a  and  B 
should  promote  increased  confidence  in  mastery/non-mastery  decisions.  Besel's 

computational  procedures  are  however,  quite  involved,  using  a  multiple  regression 
approach  which  requires  independent  a  priori  estimates  of  variance  due  to 

conditions.  Besel  also  points  out  that  B  is  estimated  best  for  a  group  when 
the  mastery  level  is  lowered,  while  the  reverse  is  true  for  a.  In  other 
words,  Besel  has  empirically  established  a  relationship  between  errors  of 
misclasslf ication  and  criterion  level.  A  decision,  however,  has  not  been  made 
concerning,  the  relative  cost/effectiveness  of  competing  errors  of  misclassi- 
ficatlon.  Such  decisions  may  be  specific  to  Instructional  situations. 
Establishing  and  Classifying  Instructional  Objectives 
The  development  of  student  performance  objectives  for  instructional 
programs  has  become  a  widespread  process  within  the  educational  community. 
Information  is  generally  derived  from  instructional  objectives,  which  provide 
not  only  specifications  for  instruction,  but  also  the  basis  for  instructional 
evaluation  (Lyons,  1972).  Anmerman  and  Melching  (1966)  trace  the  interest  in 
behavlorally-stated  objectives  from  three  Independent  movements  within 


m 


Contemporary  Views 
38 


education.  The  first  derives  from  the  work  of  Tyler  (1934,  1950,  1964)  and 
his  associates,  who  worked  for  over  35  years  at  specifying  goals  of  education 
in  terms  of  meaningful  and  useful  information  for  the  classroom  teacher. 

Tyler *8  work  has  had  considerable  impact  in  the  trend  toward  describing 
objectives  in  terms  of  instructional  outcomes. 

The  second  development  came  from  the  need  to  specify  man-machine  inter¬ 
actions  in  modern  defense  equipment  configurations.  Miller  (1962)  was  respon¬ 
sible  for  pioneering  efforts  in  describing  and  analyzing  job  tasks.  Chenzoff 
(1964)  reviewed  the  then  existing  methods  in  detail,  and  many  more  have 
appeared  since  that  date.  Mora  recently  Davies  (1973)  classified  task 
analysis  schemes  into  six  categories: 

1.  Task  analysis  based  upon  objectives,  which  involves  analysis  of  a 
task  in  terms  of  the  behaviors  required. 

2.  Task  analysis  based  upon  behavioral  analysis  of  concepts,  chains,  etc. 

3.  Task  analysis  based  upon  Information  processing  needs  *or  performance. 

4.  Task  analysis  based  upon  a  decision  paradigm  which  empnasizes  the 
judgment  and  decision-making  rationale  of  the  task. 

5.  Task  analysis  based  upon  the  subject  matter  structure  of  a  task. , 

6.  Task  analysis  based  upon  vocational  schmeatlcs  which  involve 
analysis  of  jobs,  duties,  tasks  and  task  elements. 

The  point  of  Davies'  breakdown  is  that  there  is  no  single  task  analysis  pro¬ 
cedure  which  is  always  applicable.  The  typical  approach  is  to  create  a  new 
task  analysis  scheme  or  modify  an  existing  scheme  to  suit  the  needs  of  the 
situation  at  hand. 

The  third  development  was  the  concept  of  programmed  instruction,  which 
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required  writers  of  programs  to  acquire  specific  Information  on  instructional 
objectives. 

It  is  apparent  that  the  use  of  instructional  objectives  has  now  become 
an  accepted  educational  practice.  A  critical  event  in  this  process  was  the 
publication  of  Hager's  (1962)  little  book  Preparing  Instructional  Objectives. 

In  this  work,  Hager  set  forth  requirements  for  the  form  of  a  useful  objective, 
but  did  not  deal  with  procedures  by  which  one  could  obtain  information  to 
support  preparation  of  the  objectives.  A  series  of  additional  works,  including 
one  on  measuring  Instructional  intent  (Mager,  1972),  have  dealt  more  thoroughly 
with  such  issues. 

Actual  behaviors  exhibited  by  acceptable  performers  are  generally  preferred 
as  the  bases  for  constructing  instructional  objectives.  However,  data  can 
come  from  a  variety  of  sources,  including: 

1.  Supervisor  interview 

2.  Job  Incumbent  interview 

3.  Direct  observation  of  performance 

4.  Inferences  based  upon  system  operation 

5.  Analysis  of  "real  world"  use  of  instruction 

6.  Instructor  Interview 

Many  sophisticated  methods  are  used  to  derive  these  data.  Flanagan's 
(1949)  "critical  incident  technique",  and  the  modifications  it  has  inspired, 
are  good  examples  of  efforts  aimed  at  Identifying  essential  performances,  while 
eliminating  information  not  directly  related  to  the  successful  accomplishment 
of  a  Job-related  task. 

The  choice  of  a  method  for  deriving  job  training  content  must  be  based 
upon  thq  type  of  performance,  and  upon  other  realistic  factors  such  as 
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assessibility  of  the  performance  to  direct  observation.  Generally  the 
solution  is  less  than  ideal,  but  techniques  such  as  Ammerman  and  Melching's 
(1966)  can  be  used  to  review  objectives  and  to  provide  a  useful  critique  of 
the  data  collection  method.  An  exhaustive  review  of  the  techniques  for 
deriving  Instructional  objectives  is  inappropriate  here.  The  reader  is 
directed  to  Llndvall  (1964)  and  to  Smith  (1964)  for  comprehensive  treatments 
of  this  question. 

Klein  and  Rose  off  (1973)  have  summarized  four  general  procedures  used 
in  developing  objectives  for  CRTs.  The  first,  "expert  judgment”,  is  the  most 
common  approach  in  their  opinion,  and  involves  a  small  group  of  subject  matter 
experts  meeting  to  arrive  at  a  consensus  of  important  objectives  to  measure. 
Objectives  thereby  identified  are  screened  on  the  basis  of  practical  constraints, 
and  are  modified  as  necessary.  The  second  procedure,  "consensus  judgment", 
is  similar  to  the  first,  but  uses  "various  groups  such  as  community  represen¬ 
tatives,  curriculum  experts",  etc.  to  decide  which  objectives  should  be 
measured.  Appropriate  measurement  or  curriculum  personnel  then  translate  the 
objectives  into  terms  permitting  assessment. 

"Curriculum  analysis”,  the  third  approach.  Involves  analysis  of  curriculum 
materials  (e.g.,  textbooks),  and  subsequent  identification  or  Inference  of 
objectives  tnereln  by  a  team  of  curriculum  experts.  Finally,  the  fourth 
approach,  "analysis  of  the  area  to  be  tested"  is  similar  to  the  task  analytic 
approaches  previously  discussed.  Contents  and  behaviors  in  the  subject  area 
arc  identified,  and  organized  hierarchically  (or  according  to  some  other 
sequence)  to  derive  objectives. 
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Ammerman  and  Melchlng  (1966)  have  developed  a  system  for  the  analysis 
and  classification  of  terminal  performance  objectives.  They  examined  a  great 
number  of  objectives  from  diverse  sources,  and  concluded  that  five  factors 
accounted  for  the  significant  ways  in  which  most  existing  performance  objec¬ 
tives  differed.  These  factors  are: 

1.  Type  of  performance  unit 

2.  Extent  of  action  description 

3.  Relevancy  of  student  action 

4.  Completeness  of  structural  components 

5.  Precision  of  each  structural  component 

Ammerman  and  Melchlng  have  identified  a  number  of  levels  under  each  of  these 
factors.  For  instance,  factor  #1  has  three  levels,  from  specific  task,  which 
involves  one  well-defined  particular  activity  in  a  specific  work  situation, 
to  generalized  behavior,  which  refers  to  a  general  measure  of  performance, 
or  way  of  behaving,  such  as  the  "work  ethic”. 

VI th  these  five  factors,  and  their  sub-levels,  it  is  possible  to  classify 
any  terminal  objective  via  a  five  digit  number.  This  scheme  is  valuable  for 
management  control  and  for  review  of  terminal  performance  objectives.  Ammerman 
and  Melchlng  feel  the  method  can  fulfill  three  main  purposes: 

1.  Provision  of  guidance  for  the  derivation  of  objectives  and  for 
standardization  of  statements  of  objectives,  so  that  all  may  meet 
the  criteria  of  explicitness,  relevance,  and  clarity. 

2.  Evaluating  the  proportion  of  objectives  dealing  with  specific  or 
generalized  action  situations. 

3.  Evaluating  the  worth  of  a  particular  method  for  deriving  objectives. 
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This  is  a  useful  method,  particularly  where  a  panel  of  judges  is  available 
to  review  each  objective.  A  coefficient  of  congruence  can  be  computed  among 
judges'  placement  of  objectives  on  the  five  dimensions.  Used  in  this  fashion, 
the  Ammerman  and  Melching  method  should  prove  useful  in  the  development  of 
instructional  systems. 

Developing  Test  Materials  and  Item  Sampling 

Hlvely  and  his  associates  (1968,  1973)  have  provided  a  useful  scheme  for 
writing  items  which  are  congruent  with  a  criterion.  Hlvely's  efforts  have 
been  primarily  in  the  area  of  domain-referenced  achievement  testing.  In  this 
system,  an  item  form  constitutes  a  complete  set  of  rules  for  generating  a 
domain  of  test  items  which  are  accurate  measures  of  an  objective. 

Popham  (1970)  has  pointed  out  that  the  item  form  approach  has  met  with 
success  in  content  areas  having  well-defined  limits.  In  such  areas  (e.g., 
mathematics) ,  Independent  judges  tend  to  agree  on  whether  or  not  a  given  item 
is  congruent  with  the  hlghly-speclf ic  behavior  domain  referenced  by  the  item 
form.  As  less  well-defined  fields  are  considered,  however,  it  becomes  more 
difficult  to  prepare  item  forms  sb  that  they  yield  test  items  which  are  judged 
congruent  with  a  given  instructional  objective.  Popham  (1970)  has  remarked: 
"Perhaps  the  best  approach  to  developing  adequate  criterion-referenced  test 
items  will  be  to  sharpen  our  skill  in  developing  item  forms  which  are  parsi-r 
monious  but  also  permit  the  production  of  high  congruency  test  items." 

Cronbach  (1963,  1972)  presents  a  generallzabllity  theoretic  approach 
to  achievement  testing.  Cronbach’s  theory  Involves  a  mathematical  model 
in  the  framework  of  which,  an  achievement  test  is  assumed  to  be  a  sample 
from  a  large,  well-defined  domain  of  items.  Parallel  tent  forms  are  obtained 
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by  repeated  sampling.  Analysis  of  variance  technique?  (particularly  intra-class 
correlation)  are  used  to  obtain  estimates  of  variance  components  due  to 
sampling  error,  testing  conditions,  and  other  sources  which  may  affect  the 
reliability  of  scores. 

Generallzability  theory  has  been  extended  (Osburn,  1968)  by  including 
concepts  of  task  analysis  to  Allow  sorting  subject  matter  into  behavioral 
classes.  Osburn  (1968)  has  termed  this  "Universe-defined  achievement  testing." 
Hlvely  (1968,  1973)  has  used  these  techniques  in  an  exploration  of  a  mathematics 
curriculum.  Mathematics  represents  a  subject  domain  particularly  suited  to 
this  approach  and  Hively  reported  success  as  evidenced  by  high  intra-class 
correlations  between  sets  of  items  sampled  from  a  universe.  The  technique 
appears  to  have  diagnostic  utility,  and  is  also  relevant  for  examining  relation¬ 
ships  between  knowledges  and  skills. 

Rogers  (1965)  has  stated  that  "The  major  problem  in  developing  the  test 
item  is  to  clearly  communicate  the  question  or  problem  to  the  student".  He 
suggested  11  practical  guidelines  to  help  surmount  this  problem,  many  of 
which  entail  logistical  considerations  in  the  presentation  of  performance 
test  items  to  examinees. 

Swezey  and  Pearlstein  (1974)  suggested  that  items  be  developed  in  conjunc¬ 
tion  with  a  carefully-defined  test  plan  (See  "Issues  Related  to  CRT  Construction" 
section).  They  also  offered  the  following  suggestions  for  the  development 
of  CRT  items: 

1.  "Make  the  test  items  include  the  same  conditions  and  standards  (no 
more,  no  less)  as  those  specified  in  the  objective." 

"Use  graphs,  drawings,  and  photographs  when  necessary  for  clear 
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communication." 

* 

3.  "Present  the  test  so  it  does  not  give  the  student  hints  aa  to  the 
correct  answer,  but  never  make  it  extremely  difficult  simply  to 
ensure  a  certain  number  of  failures." 

4.  "Include  necessary  specific  instructions  with  items". 

They  noted  that  items  should  be  assessed  for  adequacy  prior  to  submission  for 
item  analysis  try-outs.  Such  assessments  Include  making  sure  the  items  match 
objectives  as  to  performance,  conditions,  and  standards;  the  items  are  clear, 
unambiguous,  and  reasonably  easy  to  administer;  and  that  they  are  at  the 
appropriate  fidelity  level,  as  determined  previously. 

Quality  Assurance 

According  to  Hanson  and  Berger  (1971),  quality  assurance  is  viewed  as 
a  means  for  maintaining  desired  performance  levels  during,  the  operation  of 
a  large-scale  instructional  program.  Six  major  components  in  a  Quality 
Assurance  program  are  Identified: 

1.  Specification  of  indicator  variables.  These  are  variables  which 
measure  important  aspects  of  a  program  and  are  individually  defined 
for  each  instructional  system.  Examples  are:  ■ 

a.  Pacing  -  measure  of  instructional  time 

b.  Performance  -  interim  measures  of  learning,  e.g.,  unit  tests, 
module  tests,  etc. 

c.  Logistics  -  Indicator  reports  of  failure  to  deliver  materials, 
and  other  implementation  difficulties  resulting  from  poorly 
planned  logistics. 
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2.  Definition  of  decision  rules.  The  emphasis  here  should  be  on 
indicators  which  signal  major  program  failures.  Critical  levels  may 
be  determined  on  the  basis  of  evidence  from  developmental  work  or 
from  an  analysis  of  program  needs. 

3.  Sampling  procedures.  Questions  about  sampling  procedures  must  be 
answered  on  the  basis  of  an  analysis  of  the  severity  of  effects 
resulting  from  insufficient  information.  Factors  to  be  considered 
include: 

a.  Number  of  program  participants  to  provide  data 

b.  How  to  allocate  sampling  units 

c.  Amount  of  information  from  each  participant 

4.  Collecting  quality  aasurance  data.  Special  problems  concern  the 
willingness  of  participants  to  cooperate  in  the  data  gathering  effort. 
Data  must  be  timely  and  complete.  Hanson  and  Berger  suggest  a 
number  of  ways  to  reduce  data  collection  problems: 

a.  Minimize  the  burden  on  each  participant  by  collecting  only 
required  data.  • 

b.  Use  thoroughly  designed  forms  and  plmplified  collection 
procedures. 

c.  Include  indicators  which  can  be  gathered  routinely  and  without 
special  effort. 

5.  Analysis  and  summarization  of  data.  Some  data  may  be  analyzed  as 
they  coote  in;  other  data  may  have  to  be  compiled  for  later  analysis. 
The  exact  technique  will  depend  on  the  type  of  decision  the  data 


must  support. 
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6.  Specification  of  actions  to  be  taken.  This  step  describes  actions 
to  be  taken  in  the  event  of  a  major  program  failure.  Alternatives 
should  be  generated  and  scaled  according  to  severity  of  failure. 
Information  on  actions  taken  to  correct  program  failures  should 
be  fed  back  into  the  program  development  cycle.  Such  feedback  is  an 
important  source  of  guidance  for  program  revision. 

Hanson  and  Berger  offer  an  illustrative  example  of  how  this  process 
might  be  implemented.  They  note  that  quality  assurance,  as  applied  to  criterion 
referenced  programs,  acts  to  ensure  that  specified  performance  levels  are 
maintained  throughout  the  life  of  a  program.  If  Internal  quality  assurance 
programs  of  this  sort  are  built  into  instruction,  then  the  probability  of  an 
instructional  program  becoming  "derailed"  while  functioning  is  minimized. 

Designing  for  Evaluation  and  Diagnosis 
Baker  (1972)  feels  that  the  critical  factor  in  instruction  is  not  how 
test  results  are  portrayed  (NRT  or  CRT)  but  now  they  are  obtained  and  what 
they  represent.  Baker  suggests  the  term  construct-referenced  to  describe 
achievement  tests  consisting  of  a  .wide  variety  of  item  types  and  well-sampled 
content  ranges.  These  tests  are  generally  of  the  norm-referenced  type. 
Criterion-referenced  tests.  Baker  feels,  are  probably  better  termed  domain- 
referenced  tests  (see  discussion  of  Hively  et  al  . ,  1968,  1973).  A  domain 
specifies  both  the  performance  a  learner  is  to  demonstrate,  and  the  content 
domain  to  which  the  performance  is  to  generalize. 

Baker  uses  the  term  objective-referenced  test  to  refer  to  another  subset 
of  CRM.  Objective-referenced  tests  start  with  objectives  based  upon  obser¬ 
vable  behaviors  from  which  it  fa  possible  to  produce  homogeneous  items 
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relating  to  the  objective.  Balter  feels  the  notion  of  domain-referenced  tests 
is  more  useful  than  the  notion  of  objective-referenced  tests. 

Each  type  of  test  provides  different  information  to  guide  improvement 
on  instructional  systems.  Construct-referenced  tests  provide  information 
regarding  the  full  range  of  contents  and  behaviors  relevant  to  a  particular 
construct.  Objective-referenced  tests  provide  items  which  exhibit  similar 
response  requirement n  relating  to  vaguely  defined  content  areas.  Domain- 
referenced  tests  include  items  which  conform  to  a  particular  response  segment, 
as  well  as  to  a  class  of  content  to  which  the  performance  is  presumed  to 
generalize. 

According  to  Baker  (1972),  a  test  should  ideally  be  capable  of  yielding 
information  needed  to  implement  an  instructional  improvement  cycle.  An  ideal 
test  should  yield  data  on: 

1.  Applicable  student  abilities 

2.  Deficiencies  in  student  achievement 

3.  Possible  explanations  for  deficiencies 

4.  Alternative  remedial  sequences 

3.  Facility  with  which  remedial  sequences  can  be  implemented. 

All  three  types  of  tests  provide  useful  data  concerning  student  abilities 
Construct-referenced  tests  are  probably  the  most  readily  available,  but  are 
often  not  administerable  on  a  cycle  compatible  with  diagnosis,  and  are  usually 
reported  in  a  nomothetic  manner.  A  well-designed  objective-referenced  test 
may  be  scheduled  in  a  more  useful  fashion.  Domain-referenced  tests  provide 
enabling  information  to  allow, instructors  to  Identify  areas  in  which  students 

are  competent.  Identification  of  performance  deficiencies  is  theoretically 

•  .  \ 

possible  with  all  threA  sets  of  data.  However,  since  cut-offs  are  often 
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arbitrary,  none  of  the  three  tests  will  necessarily  provide  adequate  infor¬ 
mation  of  this  type.  In  addition.  Incentives  are  lacking  since  most  account¬ 
ability  programs  are  used  to  punish  deficiency  rather  than  to  promote  efficiency 

Of  the  three  test  types,  the  domain-referenced  tests  give  program 
developers  the  mdst  assistance,  for  they  provide  clear  information  about 
appropriate  types  of  practice  items  in  the  areas  of  content  and  performance 
measured  by  the  test.  However,  Baker  points  out  that  domain-referenced  items 
are  hard  to  ptepare,  mainly  because  many  content  areas  are  not  analyzed  in 
a  fashion  to  allow  precise  specification  of  the  behaviors  in  the  domain. 

Establishing  Cut-Off  Scores 

Prager  et  al.  (1972)  discuss  cut-off  points  in  mastery  testing,  and  suggest 
tliat  two  general  approaches  exist.,  The  first  involves  setting  an  arbitrary 
overall  mastery  level.  A  trainee  either  attains  this  criterion  level  or  not. 

The  second  procedure  requires  that  trainees  attain  the  same  mastery  level  in 
a  given  objective,  but  allows  the  levels  to  vary  from  objective  to  objective, 
depending  upon  difficulty  of  material,  importance  of  the  objective  for  later 
successful  performance,  etc.  This  second  method  seems  more  reflective  of 
reality  but  as  Prager  et  al.  (1972)  point  out,  it  is  also  more  difficult  to 
implement,  and  to  justify  specific  levels  that  have  been  decided  upon.  Prager 
et  al.  believe  that  for  handicapped  children  at  least,  it  is  appropriate  to 
set  mastery  levels  for  each  child  relative  to  his  potential.  Nitko  (1971) 
concurs  and  suggests  different  cut-Offs  seems  doubtful. 

Lyons  (1972)  points  out  that  standards  must  take  into  account  the  varying 
criticality  of  tasks.  The  criticality  of  a  task  Is  basically  an  assessment 
of  the  effect  upon  an  operating  system  of  incorrect  performance  of  that  task. 
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Criticality  must  be  determined  during  task  analysis,  and  must  be  incorporated 
into  the  training  objective.  Unfortunately,  in  many  cases,  task  criticality 
is  not  an  absolute  judgment,  and  the  selection  of  a  metric  for  criticality 
becomes  arbitrary. 

The  approach  to  reliability  advocated  by  Livingston  (1972)  holds  some 
promise  for  determining  pass-fail  scores.  If  Livingston's  assumptions  are 
accepted,  then,  it  becomes  possible  to  obtain  increased  measurement  reliability 
by  varying  the  criterion  score.  If  the  criterion  score  is  set  so  that  a 
very  high  (or  low)  proportion  passes,  then  one  obtains  reliable  measurement. 
Unfortunately,  it  is  not  often  possible  to  manipulate  criterion  scores  to  this 
extent.  The  training  system  may  require  a  certain  number  passing,  and  the 
criterion  score  is  frequently  adjusted  to  provide  the  required  number. 

Graham  (1974)  compared  existing  eighth-grade,  objective-based  mathematics 
tests  to  domain-referenced  tests  designed  to  assess  achievement  on  the  same 
objectives,  after  both  tests  were  administered  to  151  eighth-grade  students. 

He  found  that  slight  changes  in  item  form  introduced  concommitant  skills  in 
addition  to  those  specified  by  the  objectives.  These  additional  skills  con¬ 
founded  the  Measurement  of  primary,  objective-specified  skills,  and  that 
"confounding  increases  the  number  of  scores  falling  in  the  middle  of  the 
possible  range.  .  ."  This,  in  turn,  increases  the  amount  of  overlap  between 
the  distributions  of  scores  for  masters  and  non-masters,  thereby  increasing 
both  the  number  of  scores  "at  or  near  any  selected  mastery  cut-off  score”, 
and  the  likelihood  of  misclasslf ication.  So,  for  tests  consisting  of  hetero-  . 
geneous  items — -those  in  which  measurement  of  several  skills  may  be  confounded- 
classification  of  masters  and  non-masters  may  be  seriously  affected  by  the 
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cut-off  score  selected:  The  score  distribution  for  all  examinees  is  likely 
to  be  rectangular.  But,  for  tests  composed  of  homogeneous  items  *  the  score 
distribution  is  more  likely  to  be  blmodal — masters  and  non-masters  clearly 
separated — and  there  is  much  more  latitude  in  the  placement  of  the  cut-off 
score  before  classification  is  affected. 

Svezey  and  Pearlstein  (1974')  concurred  with  Graham's  findings,  stating 
"The  more  complex  the  skills  assessed  by  the  CRT  and  the  more  varied  the 
type  of  performance  or  product,  the  greater  is  the  danger  of  mlsclasslfica- 
tion.  .  ."  They  also  noted  that  immediate  manpower  needs  and  criticality  of 
objectives  must  also  Influence  placement  of  the  cutting  score,  justifying 
lowering  or  raising  the  cut-off  level,  respectively.  Finally,  they  stated 
emphatically  that  "If  a  test  is  measuring  more  than  one  objective,  and  cut¬ 
off  scores  are  necessary,  a  cut-off  level  should  be  established  for  each 
objective."  This  last  suggestion,  if  Implemented,  would  counteract  the  type 
of  confounding  that  leads  to  rectangular  distributions  and  consequent  diffi¬ 
culty  in  setting  a  cut-off  level. 

From  this  discussion  it  appears  that  generalizable  rules  for  setting 
cut-off  scores  do  not  exist.  Training  developers  setting  cut-off  scores  must 
consider  abilities  of  the  trainee  population,  the  complexity  of  skills  and 
performances  required  by  the  objectives,  through-put  requirements  of  the 
training  system,  the  minimum  competence  requirements,  as  well  as  a  variety  of 
other  variables,  and  act  accordingly. 


Uses  of  CRM  in  Non-Military  Education  Systems 
Prager  et  al.  (1972)  describe  research  on  one  of  the  first  CRM  systems 
planned  for  widespread  implementation.  This  Individual  Achievement  Monitoring 
System  (IAMS)  was  designed  for  the  handicapped.  Prager  et  al.  point  out  that 
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standardized  tests  often  are  useless  for  handicapped  individuals,  since  they 
hsve  little  value  in  directing  remediation.  Tests  built  to  reflect  specific 
objectives  are  much  more  useful  when  dealing  with  such  populations.  The  use 
of  CRM  allows  a  handicapped  child's  progress  to  be  related  to  criterion  tasks 
and  competency  levels.  CRM  is  further  indicated  by  the  need  for  Individualized 
instruction  and  individualized  testing  when  dealing  with  individuals  who 
exhibit  a  variety  of  perceptual  and  motor  deficiencies. 

As  a  result,  a  CRM-centered  accountability  system  was  devised.  This 
project  began  with  the  construction' of  a  bank  of  objectives  and  test  items 
to  mesh  with  the  type  of  diagnostic  individualization  required  for  education 
of  the  mentally  handicapped.  To  meet  these  needs,  the  objectives  were,  of 
necessity,  highly  specified.  The  CRT-guided  instructional  system  was  geared 
to  yield  information  to  support  three  types  of  decisions:  placement,  immediate 
achievement,  and  retention.  Standardized  diagnostic  and  achievement  tests 
were  also  used  to  aid  in  placement  decisions.  It  is  still  too  early  to  comment 
on  the  ultimate  usefulness  of  this  system. 

More  recently,  Popham  (1973)  presented  data  on  the  use  of  teacher  per¬ 
formance  tests.  These  tests  require  a  teacher  to  develop  a  "mini-lesson" 
from  an  explicit  instructional  objective.  After  planning  the  lesson,  the 
teacher  Instructs  a  small  group  of  students.  At  the  conclusion  of  the  "mini- 
lesson",  students  are  given  a  post-test.  Affective  information  is  obtained 
by  asking  students  to  rate  the  interest  value  of  the  lesson.  Popham  suggested 
three  potential  applications  of  the  teacher  performance  test: 

1.  A  focusing  mechanism.  To  provide  a  mechanism  to  focus  the  teachers’ 
attention  on  the  effects  of  instruction,  not  on  "gee-whlz"  methods; 
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2.  A  setting  for  testing  the  value  of  instructional  tactics.  The 
teacher  performance  test  can  be  used  as  a  "test  bed"  to  evaluate 
the  differential  effectiveness  of  various  instructional  techniques. 
An  Important  aspect  of  this  application  Involves  a  post-lesson 
analysis  in  which  the  instructional  approach  is  appraised  in  terms 
of  its  affects  on  learners. 

3.  A  formative  or  sumnative  evaluation  device.  Popham  views  this 
application  of  teacher  performance  tests  as  extremely  important, 
particularly  in  the  appraisal  of  in-service  and  pre-service  teacher 
education  programs. 

Popham  presented  three  applications  of  the  teacher , performance  tests. 

The  applications  were  for  the  most  part  viewed  as  effective,  however  a  number 
of  problems  were  revealed  that  may  be  symptomatic  of  performance  tests  in 
general.  Popham  found  that  unless  skilled  supervisors  were  used  to  conduct 
the  mini- lesson,  many  advantages  of  the  post-lesson  analysis  were  lost. 

Popham  also  found  that  visible  dividends  were  gained  by  using  supplemental 
normative  information  to  give  teachers  and  evaluators  additional  information 
regarding  the  adequacy  of  performance. 

In  a  similar  area.  Baker  (1973)  reported  using  a  teacher  performance 
test  as  a  dependent  measure  in  the  evaluation  of  instructional  techniques. 
Baker  discussed  shortcomings  in  using  CRTs  as  dependent  variables.  These 
shortcomings  are  largely  based  on  the  peculiar  psychometric  properties  of 
CRTs.  However,  Baker  feels  that  CRM  is  valuable  for  research  purposes,  even 
with  the  large  number  of  unanswered  questions  concerning  their  reliability 
and  validity.  Baker  points  out  "...  if  the  tests  have  imperfect  reliability 
coefficients  in  light  of  Imperfect  methodology,  the  researcher  is  compelled 
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to  report  the  data,  qualify  one's  conclusions,  and  encourage  replication." 
Baker  also  feels  that  the  use  of  teacher  performance  tests  with  indeterminate 
psychometric  characteristics  is  not  ethically  permissible  for  evaluation  of 
individuals — at  least  for  the  present. 

Mlllman  (1973)  described  three  studies  on  the  psychometric  characteristics 

of  teacher  effectiveness  performance  tests,  using  materials  similar  (mini- 

% 

lessons)  to  those  of  Baker  (1973)  and  Popham  (1973).  According  to  Mlllman, 
the  most  disturbing  findings  resulting  from  the  studies  werte  "the  erratic  and 
low  test-retest  reliabilities."  Mlllman  discussed  several  possible  reasons 
for  the  discouraging  reliability  findings,  but  none  of  these  seemed  to  amel¬ 
iorate  their  significance,  so  he  concluded  that  "clearly  more  definitive  work 
is  needed  on  teaching  performance  tests.” 

In  a  slightly  different  area  of  application,  Knlpe  (1973)  summarized  the 
Grand  Forks  Learning  System  in  which  CRTs  played  a  salient  part.  The  Grand 
Forks  School  District  began  by  creating  detailed  specifications  of  performance 
»  objectives  in  K-12  for  most  subject  areas.  These  objectives  were  designed  to 
form  the  basis  of  a  comprehensive  set  of  teacher /learner  contracts,  as  one 
instructional  method.  It  was  found  that  mathematics  was  the  subject  area 
most  amenable  to  analysis,  and  therefore  it  received  the  most  extensive 
treatment.  The  mathematics  test  consisted  of  approximately  120  criterion- 
keyed  items  for  each  grade  level  3-9.  After  extensive  tryout,  items  were 
revised  on  the  basis  of  teacher  and  student  recommendations,  as  well  as  on 
the  ba«1s  of  a  psychometric  analysis.  Knipe  found  that,  teachers  regarded  the 
CRTs  as  useful  in  supplementing  NRTs,  and,  in  addition,' found  them  useful 
for  placement.  Knlpe  concluded:  "The  criterion-referenced  test  is  thq  only 
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type  of  teat  that  a  school  district  can  use  to  determine  if  it  is  working 
toward  its  curriculum  goals." 

Klein  and  Kosecoff  (1973)  summarized  present  efforts  in  non-military 
CBM,  emphasizing  CRTs  for  mathematics.  They  described  nine  different  CRTs, 
analyzing  each  as  to  their  characteristics  on  five  continua:  program  focus, 
instructional  dependence,  objective  and  item  generation,  test  models  and 
packaging,  and  test  scores.  The  following  CRTs  and  CRM  programs  were 
described:  (1)  "Prescriptive  Mathematics  Inventory",  used  to  assess  achieve¬ 
ment  on  objectives  associated  with  f ourth-through-elghth  grade  mathematics 
curricula;  (2)  "Comprehensive  Achievement  Monitoring",  a  computer-assisted 
multipurpose  evaluation  system;  (3)  "Individualized  Criterion  Referenced 
Testing",  currently  available  in  kit  form  for  assessing  reading  and  mathematics 
skills  for  grades  one  through  eight;  (4)  "Instructional  Objectives  Exchange", 
providing  manuals  covering  objectives,  sets  of  CRTs,  and  test  guides;  (5) 
"MINNEMAST  Curriculum  Project",  CRTs  designed  to  assess  the  MINNEMAST  program, 
"a  coordinated  and  sequential  mathematics  and  science  curriculum  for  the 
elementary  school";  (6)  "National  Assessment  of  Educational  Progress",  CRTs 
designed  to  assess  student. achievement  nationally,  and  available  in  forms 
for  ages  9,  13,  17,  and  adult;  (7)  "Southwest  Regional  Laboratory",  CRTs 
designed  for  quality  assurance  purposes  in  the  development  of  text-referenced 
instructional  management  systems;  (8)  "System  for  Objectives  Based  Assessment — 
Reading",  CRT  items  keyed  to  a  set  of  performance  objectives,  and  covering  K-12 
reading;  and  (9)  "Zweig  and  Associates",  CRTs  Indexed  to  prescription  r 
teaching  alternatives,  and  available  for  K-8  mathematics  assessment. 

Boyd  and  Shlmberg  (1971)  developed  a  "Handbook  of  Performance  Testing" 
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the  majority  of  which  is  devoted  to  a  portfolio  of  over  100  pages  presenting 
a  great  variety  of  criterion-referenced  performance  testa.  These  tests, 
ranging  from  woodwork  and  metal  repairs  through  dental  hygiene  to  cosmetology, 
are  presented  in  considerable  detail,  and  are  illustrative  of  creative  approaches 
to  the  design  of  performance  test  items. 

Hambleton  (1974)  has  commented  on  CUM  as  the  method  of  choice  in  eval¬ 
uating  individualized  instruction  programs.  He  has  considered  several  such 
programs  thoroughly,  and  has  recommended' three  types  of  CR  testing  as  appro¬ 
priate:  unit  pretesting,  unit  posttesting  and  curriculum-embedded  testing. 
Curriculum-embedded  testing  is  the  least  important  of  the  three,  since  deci¬ 
sions  made  on  the  basis  of  such  tests  affect  the  student  for  only  a  short 
period  of  time  and  there  exists  an  additional  check  of  mastery  on  the  posttest. 
Unit  pre  and  posttests  are  of  concern  for  assigning  students  to  instructional 
units  and  for  assessing  mastery.  False  positive  errors  on  such  tests  are 
considered  more  critical  than  false  negative  errors  by  the  author. 

Sherman,  Zieky  and  Fremer  (1974)  have  reviewed  the  process  of  developing 
CRTs  in  the  language  area.  Guidelines  for  task  analysis  are  also  presented. 

The  work  is  a  prodigious  volume  which  discusses  many  aspects  of  CRT  develop¬ 
ment  in  general  terms,  however  the  areas  of  fidelity  and  practical  constraints 
surrounding  performance  item  development  are  ignored. 

Military  Uses 

Extensive  experience  with  use  of  CRM  was  reported  by  Taylor,  Michaels, 
and  Brennan  (1973)  in  connection  with  the  Experimental  Volunteer  Army  Training 
Program  (EVATP) . ,  To  standardize  EVATP  instruction,  reviews,  and  testing, 
performance  tests  covering  a  wide  variety  of  content  were  developed  and 
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distributed  to  instructors.  The  tests  were  revised  as  experience  accumulated; 
some  tests  were  revised  as  many  as  three  times.  Drill  sergeants  used  the 
tests  for  review  or  remediation,  while  testing  personnel  used  them  for 
general  subjects,  comprehensive  performance,  and  MOS  tests.  The  tests  also 
provided  the  basis  for  the  EVATP  Quality  Control  System,  which  was  intended 
to  check  on  skill  acquisition  and  maintenance  during  training. 

Unfortunately,  problems  were  encountered  with  the  change  in  role  required 
of  the  instructors  and  drill  sergeants  under  the  system  of  skill  performance 
Instruction  and  training.  Considerable  effort  was  required  to  bring  about 
the  desired  changes  in  Instructor  role.  The  CRT-based  quality  control  system 
performed  its  function  well  by  giving  an  early  indication  of  problems  in  the 
new  Instructional  system.  Evaluation  of  the  performance-based  system  revealed 
clear-cut  superiority  over  the  conventional  instructional  system.  The  problems 
with  institutional  change  encountered  by  these  workers  should  be  noted  by 
anyone  proposing  drastic  innovation  where  a  traditional  instructional  system 
is  well-established. 

Pleper,  Catrow,  Swezey,  and  Smith  (1973)  presented  a  description  of  a 
performance  test  devised  to  evaluate  the  effectiveness  of  an  experimental 
training  course.  The  course  was  individualized,  featuring  an  automated  appren¬ 
ticeship  Instructional  approach.  Test  item  development  for  the  course  per¬ 
formance  test  was  based  on  an  extensive  task  analysis.  The  task  analysis 
Included  gathering  many  photographs  of  job  Incumbents  performing  various  tasks. 
These  photos  served  as  stimulus  materials  for  the  tests  and  were  accompanied 
by  questions  requiring  "Vhat  would  I  do"  responses,  or  identification  of 
correct  vs.  Incorrect  task  performance.  All  items  were  developed  for  audio- 
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visual  presentation,  permitting  a  high  degree  of  control  over  testing  con¬ 
ditions.  Items  were  selected  which  discriminated  along  several  criteria. 
Internal  consistency  reliability  was  also  obtained. 

A  somewhat  similar  development  project,  entitled  Learner-Centered 
Instruction  (LCD,  (Pieper  and  Swezey,  1972),  also  describes  a  CRT  develop¬ 
ment  process.  Here,  a  major  effort  was  devoted  to  using  alternate  form  CRTs, 
not  only  for  training  evaluation,  but  also  for  a  field  follow-up  performance 
evaluation  after  trainees  had  been  working  in  field  assignments  for  six  months 

Air  Porce  Pamphlet  50-58,  the  Handbook  for  Designers  of  Instructional 
Systems,  is  a  five-volume  document  which  includes  a  volume  dealing  with 
objectives  and  CRM.  A  job  performance  orientation  to  CRM  is  advocated. 
Specific  guidelines  for  task  analysis  and  for  translating  criterion  objectives 
into  test  items  are  presented  for  both  "hands-on  performance"  and  written 
contexts.  The  document  is  a  good  guide  to  basic  "do's"  and  "don'ts"  in  CRT 
construction. 

A  similar  Army  document,  TRADOC  Reg  350-100-1  presents  guidelines  for 
developing  evaluation  materials,  and  for  quality  control  of  training.  The 
term  is  used  interchangeably  with  "performance  tests"  and  with  "achievement 
tests"  in  this  document.  The  areas  of  CRM,  in  particular,  and  of  Evaluation, 
in  general,  are  given  minimal  coverage.  CON  Pam  350-11  is  essentially  a 
revision  of  TRADOC  Reg  350-100-1,  designed  to  be  compatible  with  unit  training 
requirements.  This  document,  although  briefly  mentioning  testing  and  quality 
control,  rresents  virtually  no  discussion  of  CRM. 

Various  Army  schools  have  developed  manuals  and  guides  for  their  own 
use  in  the  area  of  systems  engineering  of  training.  The  Army  Infantry  school 
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at  Fort  Bennlng,  Georgia,  for  example,  has  published  a  series  of  Training 
Management  Digests  as  well  as  a  Training  Handbook  and  an  Instructor's  Handbook. 
There  also  exist  generalized  guidelines  for  developing  performance-oriented 
test  items,  in  terms  of  memoranda  to  MOS  test  item  writers  and  via  the 
contents  of  the  TEC  II  program  (Training  Extension  Course) .  The  Field 
Artillery  school  at  Fort  Sill,  Oklahoma  provides  an  Instructional  Systems 
Development  Course  pamphlet  as  well  as  booklets  on  Preparation  of  Written 

i 

Achievement  Examinations  and  an  Examination  Policy  and  Procedures  Guide  in 
the  gunnery  department.  The  Armor  school  at  Fort  Knox,  Kentucky,  publishes 
an  Operational  Policies  and  Procedures  guide  to  the  systems  engineering  of 
'training  courses.  Generally  these  documents  provide  a  cursory  coverage  of 
CRT  development. 

The  Army  Hide  Training  Support  group  of  the  Air  Defense  school  at  Fort 
Bliss,  Texas  provides  an  interesting  concept  in  evaluation  of  correspondence 
course  development.  Although  correspondence  course  examinations  are  necess¬ 
arily  paper-and-pencll  (albeit  criterion-referenced  to  the  extent  possible) , 
many  auch  courses  contain  an  OJT  supplement  which  is  evaluated  via  a  perfor¬ 
mance  test  administered  by  a  competent  monitor  in  the  field  where  the  corres¬ 
pondent  is  working.  This  is  a  laudable  attempt  to  move  toward  performance 
testing  in  correspondence  course  evaluation.  A  supplement  to  TRApOC  Reg  350- 
100-1  on  developing  evaluation  instruments  has  also  been  prepared.  This 
guide  provides  examples  of  development  of  evaluation  instruments  in  radar 
checkout  and  maintenance  and  in  leadership  areas. 

A  course  entitled  "Objectives  for  Instructional  Programs"  (Insgroup, 

1972)  which  is  used  at  a  number  of  Army  installations  has  provided  a  dla-  , 
gramma tic  guide  to  the  development  of  instructional  programs.  CRM  is  not 
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covered  specifically  in  this  document,  nor  is  it  addressed  in  the  recent 
Army  "state-of-the-art"  report  on  instructional  technology,  Branson  et  al. 
(1973).  However,  a  CISTRAIN  (Coordinated  Instructional  Systems  Training) 
course  (Deterline  and  Lenn,  1972a,  b) ,  which  is  also  used  at  Army  installa¬ 
tions  for  training  instructional  systems  developers,  does  deal  with  CRT 
development  and,  in  fact,  provides  instructions  for  writing  items  and  for 
developing  CRTs.  The  study  fuide  (1972b)  deals  with  topics  such  as  devel¬ 
oping  criteria,  identifying  objectives,  selecting  objectives  via  task  analysis 
developing  baseline  CRT  items,  revising  first  draft  items,  and  preparing 
feedback.  This  document  provides  a  good  discussion  of  CRT  development  in 
an  overview  fashion.' 

Swezey  and  Pearlstein's  (1974)  document.  Developing  Criterion-Referenced 
Tests,  was  prepared  under  contract  to  the  Army  Research  Institute  for  the 
Behavioral  and  Social  Sciences,  and  provides  comprehensive  descriptions  of 
a  process  for. the  development,  validation,  and  use  of  CRTs  in  military  appli¬ 


cations.  The  manual  covers  distinctions  between  CRM  and  NRM,  applications 


of  CRTs, 


ling  adequacy  of  objectives,  development  of  thorough  test  plans, 


construction  of  item  pools,  selection  of  "best"  items  by  item  analysis  and 
and  item  review  procedures,  administration  and  scoring  ol:  CRTs,  and  assess¬ 
ment  of  CRT  reliability  and  validity.  Thfe  procedures  fo::  CRT  development 
presented  therein  were  derived  from  a  comprehensive  review  of  CRM  literature, 
U.S.  Army  FM  21-6  has  recently  undergone  a  comprehensive  revision  to 
suit  the  needs  of  field  trainers.  The  revised  manual  is  generally  in  tune 
with  contemporary  training  emphasis,  with  considerable  liformation  on  indivi¬ 
dualized  training  and  team  training.  In  particular,  the  extensive  guidance 
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provided  on  generation  of  objectives  should  prove  very  useful  to  field  trainers. 
While  the  revised  7M  21-6  does  not  specifically  refer  to  CRM,  the  obvious 
emphasis  on  NRT,  which  distinguished  its  earlier  version,  is  gone.  A  possible 
weakness  in  the  revised  version  is  the  tacit  assumption  that  all  trainees  will 
reach  the  specified  standard  of  performance.  Although  the  requirement  that 
all  trainees  reach  criterion  is  not  by  itself  unreasonable,  practical  constraints 
of  time  and  cost  sometimes  dictate  modified  standards  (e.g.,  80Z  reaching 
criterion),  just  as  Board  actions  or  career  reassignment  may  also  affect  the 
percentage  of  trainees  reaching  criterion.  Where  It  is  not  feasible  to  wash¬ 
out  or  to  recycle  trainees,  then  remediation  must  be  designed  to  permit  an 
economical  solution.  FM  21-6  does  not  seem  to  address  the  remediation  problem. 

In  general,  though,  FM  21-6  is  a  good  working  guide  to  field  training.  It 
will  be  Interesting  to  see  how  effective  it  is  in  the  hands  of  typical  field 
training  personnel. 

The  use  of  CRTs  In  military  operations  has  been  slowed  by  the  high  initial 
cost  of  developing  criterion-referenced  performance  tests.  Often  the  use  of 
CRTs  for  performance  assessment  has  required  operational  equipment  or  inter¬ 
active  simulators,  drastically  raising  costs.  A  solution  to  the  cost  problem 
may  be  found  In  the  notion  of  Osborn  (1970)  who  has  devised  an  approach  to 
"synthetic  performance  tests"  which  may  lead  to  lowered  testing  costs,  although 
little  concrete  evidence  has  appeared  in  the  literature  to  date. 

From  these  limited  examples,  it  appears  that  the  civilian  sector  has  led 
in  the  research  of  methodological  and  theoretical  questions  concerning  the  use 
of  CRM.  However,  the  military  has  clearly  led  in  the  development  and  practical 
application  of  CRM. 
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Indirect  Approach  To  CRM 

Fremer  (1972)  suggested  that  it  is  meaningful  to  relate  performance  on 
Survey  Achievement  tests  to  significant  real-life  criteria,  such  as  minimal 
competency,  in  a  basic  skills  area.  Fremer  discussed  various  ways  of 
relating  survey  test  scores  to  criterion  performance.  All  of  these  approaches 
are  aimed  at  criterion-referenced  interpretation  of  test  scores.  Fremer 
proposed  that  direct  criterion-referenced  inferences  about  an  examinee's 
abilities  need  not  be  restricted  to  tests  that  are  composed  of  actual  samples 
of  the  behavior  of  interest.  He  suggested  that  considerable  use  can  be  made 
of  the  relationships  observed  among  apparently  diverse  tasks  within  global 
content  areas.  Fremer  argued  further  that  tasks  which  are  not  samples  of 
an  objective  may  provide  an  adequate  basis  for  generalization  to  that  objec¬ 
tive.  He  noted  that,  given  a  nearly  ihfinite  population  of  objectives,  the 
use  of  a  survey  instrument  as  a  basis  for  making  criterion-referenced  infer¬ 
ences  would  allow  Increased  efficiency. 

An  example  was  presented,  using  a  survey  reading  test  to  make  inferences 
about  ability  to  read  a  newspaper  editorial,  A  CRT  of  ability  to  read 
editorials  might  consist  of  items  quite  different  from  the  behavior  of  inter¬ 
est.  Fremer  offered  the  illustrative  example,  of  using  vocabulary  test  scores 
to  define  objective-referenced  statements  of  ability  to  read  editorials.  He 
noted,  however,  that  the  usefulness  of  interpretive  tables,  i.e.,  those 
that  provide  statements  referencing  criterion  behaviors  to  a  range  of  test 
scores,  depends  heavily  upon  the  method  used  to  establish  the  relationship 
between  the  survey  test  scores  and  the  objective-referenced  ability.  An 
essential  aspect  would  be  the  use  of  a  large  and  broad  enough  sample  of 
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criterion  performance  to  permit  generalization  to  the  broader  range  of 
performances. 

Fremer's  example  provided  for  the  definition  of  several  levels  of 
mastery  and  pointed  out  that  an  absolute  dichotomy,  mastery  versus  non¬ 
mastery,  will  seldom  be  meaningful.  It  is  difficult  to  understand  why 
Fremer  made  this  statement,  as  the  basic  use  of  CRT  is  to  decide  whether 
an  individual  possesses  sufficient  ability  to  be  released  into  the  field 
or  requires  further  instruction.  Many  levels  of  performance  can  be  identi¬ 
fied,  but  are  ultimately  reduced  to  pass-fail  or  to  mastery/non-mastery. 
Fremer  apparently  based  his  objection  on  measurement  error  which  can  render 
classification  uncertain.  However,  as  discussed  earlier,  proper  choice  of 
cut-off  and  careful  attention  to  development  should  minimize  classification 
errors.  Further,  classification  according  to  levels  in  addition  to  mastery/ 
non— mastery  would  only  increase  the  probability  of  classification  errors. 
Fremer  also  proposed  that  the  notion  of  minimal  competency  should  encompass 
a  variety  of  behaviors  of  varying  importance,  and  that  the  metric  of  impor¬ 
tance  will  vary  with  the  goals  of  the  educational  system. 

Fremer  (1972)  also  set  forth  a  method  for  relating  survey  test  perfor- 
mauce  to  minimal  competency  standards  that  Involves  a  review  of  the  pro¬ 
portion  of  students  who  are  rated  as  failures  at  some  point  in  the  curricu¬ 
lum.  This  serves  as  a  rough  estimate  of  the  proportion  of  students  failing 
t*-*  achieve  minimal  competency.  It  is  then  possible  to  apply  this  proportion 
to  the  score  distribution  for  the  appropriate  test  in  a  survey  achievement 
test,  clearly  a  normative  approach. 
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A  second  approach  to  referencing  survey  achievement  tests  to  a  criterion 
of  minimal  competency  is  to  acquire  instructor  judgment  about  the  extent  co 
which  individual  items  could  be  answered  by  students  performing  at  a  mini¬ 
mal  level.  By  summing  across  items,  it  is  possible  to  obtain  an  estimate 
of  the  expected  minimum  score.  Fremer  however,  recognized  the  limitations 
of  this  latter  process  with  its  high  reliance  on  informed  judgment.  A  third 
method  proposed  by  Fremer,  seeks  to  define,  minimal  competency  in  terms  of 
student  behaviors.  The  outcome  of  this  method  is  the  identification  of 
bands  of  test  scores  associated  with  minimal  competency.  As  in  the  second 
method,  processes  involved  in  this  method  also  rely  on  informed  judgment. 

Still  another  method  proposed  by  Fremer  involves  developing  new  tests 
with  a  very  narrow  focus,  i.e.,  a  smaller  area  of  content  and  a  restricted 
range  of  difficulty.  Using  this  method,  it  is  not  necessary  to  address 
every  possible  objective,  however,  a  test  composed  of  critical  items  can  be 
developed  by  sampling  from  the  item  pool.  The  next  step  in  the  process 
involves  relating  achievement  at  various  curriculum  placements  to  the  focused 
test  and  the  survey  instrument.  This  allows  keying  of  the  items  on  the 
survey  test  to  specific  critical  objectives.  A  final  method  put  forth  by, 
Fremer  is  the  stand-alone  work  sample  test.  This  technique  is  intended 
for  use  when  there  is  an  objective  that  is  of  such  interest  that  it  should 
be  measured  directly. 

The  procedures  ennunclated  by  Fremer  are  clever  in  concept,  but  are 
mainly  applicable  to  school  systems,  and  traditional  curricula,  where  well- 
developed  survey  instruments  exist.  Even  where  appropriate  survey  instru¬ 
ments  exist,  considerable  work  is  involved  in  keying  the  survey  instrument. 
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In  non-school  system  instructional  environments,  dealing  with  non-tradltional 
curricula,  it  is  unlikely  that  appropriate  survey  instruments  exist. 

Gray  (1973)  developed  a  written  CRT  designed  to  assess  performance  on 
the  Piagetian  tasks  of  pendulum  oscillation,  equilibrium  in  the  balance,  and 
combinations  of  colorless  and  colored  chemical  bodies.  Ninety-six  subjects — 
12  in  each  of  8  age  groups  (9-16  years  old) — were  administered  the  written 
test  and  the  actual  Piagetian  tasks  in  a  counterbalanced  design.  Gray's 
statistical  analysis  revealed  that,  in  most  cases  "the  correspondence  between 
the  predicted  and  written  item  sequences  is  excellent."  He  concluded  that 
"the  correlations  between  the  two  methods  measuring  the  same  set  of  develop¬ 
mental  logic  (validity  values)  along  with  moderate  reliabilities  are  encour¬ 
aging.  .  .  [and]  support  the  conclusion  that  a  written  test  using  the  devel¬ 
opmental  logic  postulated  by  Piaget  as  its  behavioral  criterion  is  definitely 
possible.  .  ."  Although  Gray  noted  that  there  was  considerable  room  for 
improvement  in  this  particular  attempt  at  test  development,  the  implications 
for  an  indirect  approach  to  CRM  are  obvious. 

Using  NRTs  To  Derive  CRM  Data 

Cox  and  Sterrett  (1970)  proposed  an  Interesting  method  for  using  NRTs 
to  provide  CRM  information.  The  first  step  in  this  method, is  to  specify 
curriculum  objectives  and  to  define  student  achievement  with  reference  to 
these  objectives.  The  second  step  involves  coding  each  standardized  test 
item  wltn  reference  to  curriculum  objectives.  With  coded  test  items  and 
knowledge  of  the  position  of  each  student  in  the  curriculum,  it  is  then 
pocsible  to  determine  the  item  validity,  in  the  sense  that  students  should 
be  able  to  correctly  answer  items  that  are  coded  to  objectives  that  have 
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already  been  covered.  Step  three  is  scoring  the  teat  independently  for 
each  student,  taking  into  account  position  in  the  curriculum.  The  authors 
suggested  that  this  model  is  particularly  applicable  to  group  instruction, 
since  placement  in  the  curriculum  can  generally  be  regarded  as  uniform. 
Therefore,  it  is  possible  to  assign  each  student  a  score  on  items  whose 
objectives  he  has  covered.  It  is  also  possible  to  obtain  information  on 
objectives  which  were  excluded  or  not  yet  covered. 

Livingston  (1972c)  delineated  a  method  for  computing  criterion-referenced 

indices  from  a  set  of  norm-referenced  test  scores.  First,  the  norm-referenced 

2  2 
mean  (>wx)»  variance  (CT  (X)],  and  reliability  coefficient  ( j>  (X.T^)]  are 

computed.  Then,  formulae  are  used  for  conversion  to  criterion-referenced 

indices.  For  example,  the  criterion-referenced  reliability  coefficient 
2  ' 

[k  (X,Tx)]  is  found  by  the  following  formula: 

k>(X,T.)  W'T.UHD  +  (m,  -  C,)» 

#’(*)  + o i.-c.y 

where:  C  ■  the  criterion  score 
x 

The  appropriateness  of  Livingston's  ’  techniques  have  yet  to  be  empirically 
verified,  however. 

Considerations  for  a  CRT  Implementation  Model 
The  development  and  use  of  CRM  is  a  fairly  recent  occurrence  in  instruc¬ 
tional  technology.  Partially  as  a  result  of  this,  there  is  no  comprehensive 
theory  of  CRM,  such  as  exists  for  NRM.  Hence,  the  concepts  of  CRT  validity 
and  reliability  are  not  yet  well  developed. 
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The  need  for  oontent  validity  in  CRT  is  well  recognized,  however.  But 
there  is  no  single  CRT  construction  methodology  which  will  serve  for  all 
content  domains.  Unresolved  questions  also  exist  in  the  area  of  Bandwidth 
fidelity,  and  the  use  of  reduced  fidelity  in  criterion-referenced  performance 
tests. 

The  rationale  for  the  use  of  CRM  in  evaluating  training  programs  and 
describing  individual  performance  is  well  established.  For  example,  the 
instructional  systems  development  model  developed  by  Branson,  Hannum,  Rayner, 
and  Johnson  (1974),  and  Intended  for  implementation  throughout  the  Armed 
Services,  uses  CRM  as  an  Integral  part.  Branson  et  al.  noted  that  "The 
.process  involved  in  the  development  of  objective-referenced  tests  is  the 
development  of  test  exercises  that  measure  student  performance  of  a  specific 
element  identified  in  the  analysis  of  the  learning  requirements.  .  . ,"  and 
that  "the  test  exercises  and  learning  objectives  must  be  in  agreement  and 


must  reflect  the  specific  learning  elements  that  were  identified  in  the 
learning  analysis  step  [of  the  instructional  systems  development  model]." 

To  ensure  the  best  possible  results,  military  or  industrial  users 
should  exert  every  effort  to  maintain  stringent  quality  control.  Including: 
1.  Careful  task  analysis: 

a.  Observation  of  actual  job  performance  when  possible 

b.  Identification  of  all  skills  and  knowledges  that  must 
be  trained 

c.  Careful  identification  of  job  conditions 

d.  Careful  identification  qf  job  standards 


e.  Identification  of  critical  tasks. 
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2.  Careful  formulation  of  objectives 

a.  Particular  care  in  the  setting  of  standards 

b.  Accurate  identification  of  all  objectives 

c.  Independent  checks  on  the  content  of  the  objectives 

d.  Special  attention  to  critical  tasks. 

3.  Item  development 

a.  Determine  if  all  objectives  must  be  tested 

b.  Survey  of  resources  for  test 

c.  Development  of  item  sampling  strategies 

d.  Determination  of  appropriate ' item  format 

e.  Development  of  item  pool  for  objectives  to  be  tested 

f .  Development  of  a  tryout  plan  and  criteria  for  item  acceptance 

g.  Tryout  of  items 

h.  Revision  or  rejection  of  unacceptable  items. 

4.  Consideration  of  reliability  and  validity 

Particular  care  must  be  exercised  in  setting  item  acceptance  criteria 
for  item  tryout.  The  use  of  typical  NRT  item  statistics  should  .be  minimized 
Many  usual  methods  are  not  adequate,  e.g.  Internal  consistency  estimates. 
Traditional  stability  indexes  may  also  be  inappropriate,  due  to  small  - 
numbers  of  items  and  reducted  variance. 

By  adhering  to  strict  quality  control  measures,  it  should  be  possible 
to  obtain  measures  that  have  a  strong  connection  with  a  specified  content 
domain.  Whether  they  are  sensitive  to  instruction,  or  will  vaty  greatly 
due  to  measurement  error,  is  unknown.  Careful  tryout  and  field  follow-up 
may  currently  be  the  best  controls  over  errors  of  misclassification  due  to 


Contemporary  Views 
68 

poor  measurement.  The  ethical  question  of  the  use  of  measures  with  unknown 
psychometric  properties  in  making  decisions  about  individuals  remains  to  be 
addressed. 

Cost-Benef its  Considerations 

Although  the  costs  of  training  and  the  costs  of  test  administration  can 
readily  be  quantified  in  dollar  terms,  wc  lack  an  adequate  metric  to  rigor¬ 
ously  assess  costs  of  misclassification.  Emrlck  (1971)  proposed  a  ratio  of 
regret  to  quantify  relative  decision  error  coBts.  Emrlck' s  metric  however, 
appears  rather  arbitrary  and  in  need  of  further  elaboration.  The  probability 
of  misclassification  is  the  criterion  against  which  an  evaluation  technique 
must  be  weighed.  The  results  of  misclassification  range  from  system-related 
effects  to  Interpersonal  problems.  In  some  instances  where  misclassification 
results  in  a  system  failure,  cost  can  be  accurately  measured,  and  is  likely 
to  be  high. 

A  relative  index  of  cost  can  be  gained  from  task  analysis.  If  the 
analysis  of  the  job  reveals  large  numbers  of  critical  tasks,  or  individual 
tasks  whose  criticality  is  very  high,  then  the  cost  of  supplying  a  non-master 
can  be  assessed  as  high,  and  great  effort  is  justified  in  developing  high 
fidelity  CRTs  in  conjunction  with  a  training  program.  Misclassification 
also  results  in  job  dissatisfaction  and  morale  problems,  evidenced  by  various 
symptoms  of  organizational  illness,  e.g.,  absenteeism,  high  turnover,  poor 
work  group  cohesion,  etc. 

A  possible  solution  to  the  cost-benefit  dilemma  may  come  from  work  with 
symbolic  performance  tests  and  the  work  cited  earlier  showing  that  job  know¬ 
ledge  teats  can  sometimes  suffice.  The  use  of  symbolic  tests  and/or  job 
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knowledge  tests  may  result  in  reduced  testing  costs  in  some  instances. 

However,  development  of  suitable  symbolic  performance  tests  may  prove  to 
be  difficult.  And,  as  progress  is  made  in  lowering  CBM  development  cost, 
cost-benefit  problems  will  be  largely  obviated. 

As  the  question  currently  stands,  there  is  no  doubt  that  CRM  provides 
a  good  basis  for  evaluation  of  training  and  the  determination  of  what  a 
trainee  can  actually  do.  If  the  system  in  which  the  trainee  must  function 
produces  a  number  of  critical  functions  which  will  render  misclasslf icatlon 
expensive,  then  CBM  is  a  must.  And,  if  the  system  has  been  developed  from 
task  analytic  data,  CBM  development  1b  both  desirable,  for  evaluation  pur¬ 
poses,  and  cost-effective,  whether  or  not  there  are  many  critical  tasks 
involved . 

Brief  Summary  of  the  State-of-the-Art 
in  Criterion-Referenced  Testing 

Now,  let  us  set  forth  a  general  position  on  theoretical  and  technical 
aspects  of  CRT  construction  and  use,  based  upon  the  state-of-the-art  of 
CR  testing  as  we  see  it.  Positions  are  presented  sequentially  for  the  following 
topics: 

1.  Design  considerations  and  CRT  use 

2.  Construction  methodology  and  related  issues 

3.  CRT  administration  and  scoring 

4.  Reliability  and  validity 
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gmong  tht  ujot  eouidtntioDi  Is  CU  construction  in  tht  i® 
which  specific  uses  any  sffect  test  design.  Test  design  any  very  in  several 
related  *«»«i  respects,  such  ss  the  basis  upon  which  test  iteas  ere 

constructed  and  selected.  In  CR  testing,  iteas  are  generally  developed  froa 
an  analysis  of  tasks  to  be  performed  and  froa  attempts  to  operationally 
define  the  behaviors  required.  This  is  not  necessarily  the  case  in  norm 
referenced  (HR)  testing.  The  aanner  in  which  scores  are  interpreted  and 
used  also  differentiates  CRTs  froa  HRTa.  In  CR  testing,  scores  attained  by 
examinees  are  Interpreted  against  an  external,  absolute  standard— as  oouosed 
to  the  distribution  of  scores  attained  by  other  examinees;  which  is  the  case 
with  HRIs. 

£C  wamt  first  be  decided  whether  e  CRT,  ee  opposed  to  a  HRT,  la  appropriate. 
rvr  scores  do  not  lend  themselves  to  ordering  individuals  along  a  continuum, 

Th.„  n  the  primary  use  of  test  results  is  to  select  aaong  individuals  for 
pxa^OtSon,  special  honors,  etc.,  GI  testing  is  contraindicated.  Whenever 
Information  is  desired  for  purposes. of  comparing  examinees,  HR  testing  appears 
to  be  sore  appropriate  than  CR  testing.  This  applies  to  tests  of  achievement, 
knowledge,  and  performance. 

CR  testing  is  usually  the  technique  of  choice  when  evaluations  are  to 
be  made  on  the  basis  of  an  individual's  achievement  of  specif ic'objectives. 

Here  the  primary  question  of  Interest  1a:  '’How  well  esn  an  individual 
perform  relative  to  an  external  standard?",  rather  than:  "How  well  does 
an  individual  do  compared  to  others?". 
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Cost  Kf fectiveness 

CRTs  may  b«  aort  expens  It*  to  develop  and  administer  than  IBt,  In 
tarns  of  absolute  costa.  CSI-speclfic  development  costs  are  due  largely 
to  the  need  for  carefully  deriving  and  specifying  objectives,  while 
additional  administration  costs  may  result  from  the  necessity  of  comparing 
examinee  performance  to  external  standards,  nevertheless.  Cl  testing  may 
well  be  more  cost-effective  la  the  long  run.  If  there  la  a  genuine  need 
to  ascertain  an  Individual's  ability  to  perform  a  specific  task. 

Indirect  approaches  to  criterion-referencing,  by  correlating  symbolic 
performance  and/or  job  knowledge  test  results  with  performance  measures, 
may  be  an  approach  to  alleviating  the  high  costs  of  C2Xs.  Such  approaches 
involve  the  development  of  two  tests  at  different  levels  of  fidelity  for 
each  objective,  and  subsequent  validation  of  the  Indirect  measures  against 
the  performance  measures.  Justification  for  these  approaches  center  on 
savings  1m  administration  time  and  costa. 

Development  of  dlrsct  CSTa  appears  justified,  desirable  and 
cost— pff active ,  If  there  Is  a  need  to  ensure  that  Individuals  will  be  able 
to  perform  adequately  on  the  tasks  for  which  they  are  being  trained.  When 
there  la  a  need  for  ensuring  minimal ,  absolute  levels  of  performance,  Ct 
testing  la  the  approach  of  choice. 
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Screening  and  Diagnosis 

CXXs  are  applicable  for  ou  aa  screening  devices  la  cases  where  there 
Is  s  possibility  that  Individuals  stay  be  able  to  perform  tasks  without 
training.  If  a  person  can  achieve  the  criterion  level  on  a  CXI,  he  should 
be  able  to  enter  the  job  without  intervening  training.  Similarly,  CRTs 
may  be  used  to  determine  the  appropriate  point  1  a  training  cycle  for  an 
individual  to  consents  training. 

CXXa  may  also  be  used  as  diagnostic  elds.  Persons  achieving  the 
criterion  level  might  be  channeled  into  advanced  Instruction,  or  remediation 
might  be  euggested  for  those  falling  below  criterion  level  on  certain 
objectives.  CR  testing  for  diagnostic  purposes  is  likely  to  be  more  difficult 
and  more  expensive  than  CX  testing  for  achievement  of  objectives,  because 
detailed  c?  acumen  tatira  on  the  examinees1  behavior  la  required.  This  may 
necessitate  more  examiners  and/or  more  elaborate  schenes  for  mil .rH data. 
Evaluation  of  Instructional  Programs 

Aside  from  the  assessment  of  individual  performance  against  absolute 
standards,  CRTs  may  also  be  used  to  evaluate  Instructional  programs.  Here, 
the  primary  question  of  Interest  is:  "Has  my  instructional  program  taught 
what  It  is  supposed  '.o  teach?".  HR  testing  is  less  appropriate  for  such 
an  application  than  is  CX  testing,  since  wide  score  ranges  before  and  after 
administration  of  the  instructional  program  are  not  necessarily  germane 

to  the  question  of  Interest.  CXXa  designed  for  this  application  are 
presunably  based  directly  upon  instructional  objectives  since  the  basic 

question  Is  whether  or  not  the  program  has  successfully  taught  performance 
compatible  with  the  Instructional  objectives.  CXXa  thus  provide  data  having 
direct  relevance  to  the  question. 
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Construction  Methodology  and  Related  Issues 

Dim  to  the  relative  recency  of  the  CX  testing  concept,  many  theoretical 
and  practical  aspects  of  CXI  construction  methodology  are  not  so  veil  defined 
as  la  the  case  for  XKTs.  Additional  sophistication  In  CXI  construction 
methodology  Bust  await  further  research  on  theoretical  Issues,  and  results 
from  more  extensive  attempts  at  CXT  Implementation.  Nevertheless,  some 
general  "do's  and  don'ts"  for  CXI  construction  can  be  extracted  from  the 
methodological  literature. 

Task  Analysis 

First,  CXI  construction  requires  careful  analysis  of  the  tasks 
comprising  the  test's  subject.  While  conduct  of  the  task  analysis  Itself 
may  be  outside  the  test  developer's  domain,  the  test  developer  must  obtain 
analytic  data  on:  (1)  skills  sad  knowledges  necessary  for  task  performance, 
(2)  required  performances  stated  In  behavioral  tens,  (3)  criteria  associated 
with  each  Identified  performance,  and  (4)  conditions  under  which  the  tasks 
must  be  performed. 

Without  these  data,  the  test  developer  cannot  adequately  define  objec¬ 
tives,  and  consequently  cannot  match  test  Items  to  objectives.  Nor  can  he 
ensure  the  content  validity  of  the  test.  If  usable  CRTs  are  to  be  constructed 
task  analyses  are  necessary  prerequisites. 

Preparing  Objectives 

Preparing  objectives  Is  one  of  the  first  formal  steps  in  constructing 
a  CXI.  Hager  (1962)  has  documented  a  useful  procedure  for  formatting  these 
objectives.  Hager's  suggestions  for  structuring  objectives  also  appear 
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appropriate.  Information  to  be  used  in  preparing  objectives  is  best  derived 
from  thorough  task  analytic  data. 

If  the  test  developer's  input  includes  a  list  of  unitary  objectives — 
objectives  covering  separate,  single  tasks — the  test  developer's  primary 
task  is  to  match  test  items  to  these  objectives.  The  test  developer  must 
assume  that  objectives  are  properly  matched  to  the  actual  job  ta  les.  If 
this  assumption  is  violated,  the  resulting  CRT  will  lack  content  validity. 

If  however,  the  assumption  is  accurate,  and  the  developer  propoeily  matches 
items  to  objectives,  content  validity  will  be  achieved.  Thus,  the  test 
developer  must  be  knowledgeable  about  appropriate  formats  and  quality 
standards  for  objectives  in  order  to  make  an  adequate  assessment  of  their 
suitability  for  CRT  .development. 

.■latching  Items  to  Objectives  ' 

Hager  (1973)  has  provided  a  sound  plan  for  matching  CRT  items  to 
objectives.  Hager's  plan  Involves  Batching  performances  and  conditions 
stated  in,  or  Imp 11 ad  by  objectives,  with  corresponding  Item  performances 
and  conditions.  Hager's  plan  omits  a  procedure  for  matching  standards 
among  objectives  and  test  Items,  however  implies  that  standards  should 
also  he  matched. 

The  test  constructor's  task  Is  to  create  test  items  that  are  congruent 
with  objectives.  To  the  extent  that  objectives' are  "furry",  the  test 
constructor  cannot  crests  appropriate  Items.  It  is  rsconanded  that  he 
sand  furry  objectives  back  to  their  originator,  annotating  their  difficulties 
and  requesting  a  reconsideration. 
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When  the  tut  developer  he*  received  an  adequate  objective  (or  act  of 
objectives)  for  which  a  test  is  to  be  constructed,  a  nuaber  of  factors  oust 
be  considered  before  items  are  matched  to  objectives.  These  factors  include: 
practical  constraints  in  the  testing  situation,  test  fidelity,  test  format, 
and  number  of  items  required  to  test  a  given  objective. 

Practical  constraints  must  be  systematically  assessed  before  test 
items  can  be  constructed  so  that  the  items  can  be  built  with  performance 
indicators  which  are  suitable  for  such  considerations  as:  testing  conditions, 
tester  availability,  time  availability,  facility  and  equipment  availability, 
etc.  These  considerations  obviously  impact  on  test  fidelity.  CBT  items 
should  be  constructed  at  the  highest  level  of  fidelity  practicable,  con¬ 
sistent  with  situational  constraints.  In  cases  where  critical  objectives 
are  to  be  tested,  special  care  must  be  taken  to  develop  sufficiently  high 
fidelity  items  so  that  critical  task  mastery  can  be  accurately  assessed. 
Selecting  Among:  Objectives 

The  tactic  of  selecting  among  objectives,  that  is,  randomly  testing 
a  subset  of  objectives,  may  be  used  in  some  Instances,  as  long  as  trainees 
do  not  know  the  subset  to  be  tested.'  This  tactic  must  not  be  used  when 
critical  objectives  are  Involved.  For  objectives  of  a  non-crltlcal  nature, 
selection  may  be  used  to  overcome  practical  constraints  imposed  by  the 
testing  situation,  without  necessitating  modification  of  objectives. 

Selection  among  objectives  should  never  be  done  when  it  is  necessary  to 
certify  that  individuals  qualify  on  all  objectives. 
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Number  of  Items 

Mo  hard  and  fast  rulaa  for  specifying  tha  number  of  items  to  be 
created  for  a  given  objective  exist.  It  is  recommended  that  as  many  items 
as  test  situation  time  availability  will  permit,  within  limits  suggested 
by  considerations  of  motivational  and  fatigue  factors,  should  be  included. 
As  Graham  (1974)  has  noted,  "even  for  highly  homogeneous  tests,  four  or 
five  items  may  be  necessary  to  minimize  classification  errors."  Thus, 
even  for  CRTs  measuring  a  single,  well-specified  objective  with  few  con¬ 
founding  factors,  additional  items  may  help  to  reduce  measurement  error. 

For  more  heterogeneous  tests,  the  deslreablllty  of  having  extra  items  may 
be  even  more  pronounced. 

Format 

Test  format  may,  in  many  cases,  be  largely  dictated  by  objectives. 
Certain  objectives  for  example,  may  require  hands-on  performance  testing. 
Such  things  as  number  of  items  to  be  Included,  and  practical  constraints 
such  ms  time  and  manpower  availability,  may  also  help  determine  format — e.g 
a  situational  item,  multiple-choice  format  might  be  the  only  feasible  way 
of  tasting  some  sets  of  objectives.  A  general  guideline  might  oe  based 
on  Edgerton's  (1974)  suggestion^  that  item  styj.es  not  be  mixed  In  the 
same  teat,  so  as  to  avoid  measuring  "test  taking  skill"  Instead  of  subject 
matter  competence. 

Item  generation  rulea,  such  aa  "item  forms"  and  "facets"  are  not 
yet  sufxlciently  researched  to  warrant  use  by  .  personnel  who  are  not  soph¬ 
isticated  In  psychometrics.  Bence,  for  objectives  that  may  be  tested  by 
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as  unlimited  number  of  items,  such  as  those  dealing  with  concepts,  the 
best  suggestion  that  can  be  offered  testing. personnel  at  this  time.  Is  to 
be  sure  that  each  item  matches  the  objective  it  tests. 

Item  Pools 

After  the  test  developer  has  considered  such  factors  as  fidelity, 
number  of  Items,  etc. ,  items  can  be  matched  to  objectives  using  principles 
similar  to  those  advanced  by  Hager  (1973).  The  test  developer  should  con¬ 
struct  a  pool  of  Items  considerably  larger  than  the  number  required  for 
the  test,  so  that,  the  best  items  can  be  selected.  Items  are  then  constructed 
at  the  level  of  fidelity  and  In  the  format  previously  determined. 

Item  Analysis 

Traditional  item  analysis  techniques,  like  other  statistical  techniques 
developed  in  conjunction  with  HR  testing,  have  limited  applicability  for 
C&  testing  (due  to  restricted  ranges  of  score  variance  In  CRTs).  Although 
recent  studies  have  suggested  techniques  for  increasing  variance  of  CRT 
scores  (e.g.,  Haladyna,  1973;  Voodaon,  1973)  these  techniques  are  "experi¬ 
mental",  and  it  is  not  yet  appropriate  to  apply  them  as  a  matter  of  course. 
Consequently,  until  additional  research  develops  and  refines  new  approaches 
to  Item  analysis  appropriate  for  CR  testing*  a  simple  Index  which  relies 
on  the  use  of  "masters"  and  "non-masters"  (e.g.,  those  who  are  beginning 
training  and  those  who  have  completed  training)  appears  to  be  an  appropriate 
technique. 

"Masters"  and  "non-masters"  are  tested  and  their  patterns  of  pass 
and  fall  on  the  Items  are  recorded,  jjf  coefficients  are  computed  using 
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four-fold  tables  ("master"-"noimaster",  pasa-fall)  for  each  item.  Good 
1  tens  are  those  which  are  passed  by  "masters"  and  failed  by  "none asters." 
Items  are  poor  if  there  is  little  difference  on  pass-fail  patterns  between 
"masters"  and  "nonmasters",  or  if  more  "nonmastars"  than  "masters"  pass  them. 
Low  or  negative  coefficients  act  as  warning  flags.  Items  receiving  low 
coefficients  should  either  be  thrown  out  or,  at  least,  reconsidered  care¬ 
fully  before  Inclusion  la  a  CRT.  These  warning  flags  are  relevauc  If  the 
pool  of  Items  Is  homogeneous,  or  If  it  is  composed  of  items  testing  several 
objectives. 

Care  must  be  exercised  to  ensure  that  all  objectives  are  represented  by 
the  proper  number  of  items,  as  determined  previously.  Item  balance  among 
disparate  objectives  measured  by  the  same  test  should  be  maintained  as  planned. 

CRT  Administration  and  Scoring 

Administration 

Lika  all  tests,  CKTs  must  be  administered  under  standardized  conditions. 
CRTs  should  include  accompanying  documentation  which  specifies:  (1)  test 
administration  conditions;  (2)  instructions;  (3)  administration  procedures 
(including  how  to  handle  questions,  how  to  check  and  set  up  test  supplies 
and  equipment,  etc.;  (4)  circumstances  for  excusing  examinees  from  the 
test,  due  to  illness,  fatigue,  etc.;  (5)  environmental  circumstances  under 
which  test  administration  should  be  cancelled;  and  (6)  scoring  procedures. 

Test  administrators  must  be  trained  to,  follow  specifications  precisely. 
Since  specif lcatlons  will  apply  to  any  test,  documentation  accompanying  a 
specific  CRT  need  not  necessarily  bs  extremely  detailed-— except  for  special 
requirements  such  as  setting  up  the  test  facility,  and  test  scoring. 
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Scoring 

Tut  scoring  procedure*  aust  be  developed  during  the  test  construction 
process,  since  they  will  generally  vary  as  a  function  of  the  type  of  CKT. 
There  are  a  number  of  Interrelated  decisions  that  wist  be  aade  concerning 
scoring.  These  Include: 

1.  Objectivity  of  scoring 

2.  Process  vs  product  scoring  aethoda 

3.  Type  of  scoring  (go/no-go,  rating  scales,  etc.) 

4  Cut-off  points 

5.  Non-Interference  vs  assist  methods 

Objectivity 

'  Every  atteapt  should  be  made  to  maximise  objectivity  In  scoring  CRTs. 
In  low  fidelity  tests,  such  as  those  using  multiple-choice  formats, 
objectivity  Is  apparent.  {Such  tests  can  be  computer-scored.)  In  higher 
fidelity  CRTs,  It  Is  relatively  simple  to  aaxlalze  objectivity  for  herd- 
skill  subjects,  however  soft-skill  areas,  such  as  tactics,  leadership,  etc. 
are  anre  difficult  to  test  objectively*  To  the  extent  that  objectivity  Is. 
not  achieved,  reliability  is  attenuated.  Efforts  aust  be  made  to  specify 
soft-skill  objectives  precisely,  so  that  appropriate  iteas  (with  associated 
objective  scoring  procedures)  can  he  prepared.  Even  in  the  best  of  cir¬ 
cumstances,  however,  soft-skill  CRTs  will  probably  have  less  objective 
scoring  guides  than  will  tests  of  hard- a  kill  subjects.  One  way  to  maximize 
objectivity  in  soft-skill  JR  testing  la  to  require  several  raters  ta  assess 
each  individual.  Inter-rater  rellablltly  can  then  be  calculated.  If  low 
Inter-rater  reliability  Is  found  consistently,  the  test  should  be  revised. 
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Process-Product 

i.C.  Smith's  (1965)  guidelines  for  determining  proems*  versus  product 
measurement  appear  adequate,  with  slight  modifications.  That  Is,  product 
measurement  is  always  appropriate  if  the  objective  specifies  a  product-. 

Vhaa  a  product  measure  is  called  for,  it  should  be  incorporated  into  the 
objective,  and  carried  over  into  the  test  items.  Product  measures  are 
1  mA  for  when: 

(a)  the  product  can  be  measured  as  to  presence  or  characteristics 

(b)  the  procedure  leading  to  the  product  can  vary  without  effecting 
the  product . 

Process  measurement  is  indicated  when  the  objective  specifies  a 
required  sequence  of  performances  which  can  be  observed,  and  the  perfor¬ 
mance  is  as  Important  as  the  product.  Process  measurement  is  also  appro¬ 
priate  in  cases  where  the  product  cannot  be  measured  for  safety  or  other 
constraining  reasons. 

There  may  also  be  situations  where  both  process  sud  product  measure¬ 
ment  are  appropriate  for  a  given  objective.  Following  are  several  examples 
of  conditions  that  may  call  for  both  product  and  process  measurement: 

(a)  Although  the  product  is  more  important  than  the  process (es)  which 
lead  to  its  completion,  there  are  critical  steps  whlch^  if 
misperformed,  may  cause  damage  to  equipment  or  injury  to  personnel. 

(b)  The  process  and  product  are  of  similar  importance,  but  it  cannot 
be  assumed  that  the  product  will  meet  criterion  levels. 
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(c)  Diagnostic  Information  la  needed.  (By  haring  process  aa  veil  aa 
product  measures ,  information  aa  to  why  the  product  does  not 
meet  the  criterion  can  he  obtained.) 

When  both  process  and  product  msaaurea  are  obtained  for  a  specific 
objective*  scoring  must  follow  the  criterion  specified  by  the  objective. 
That  Is,  if  the  criterion  specifies  only  a  product*  then  process  scores 
should  not  be  used  to  assess  achievement  of  the  criterion. 

Type  of  Scoring 

'The  type  of  scoring  system  employed  muat  be  appropriate  for  the 
uojectlve.  11  the  objective  specifies  an  action  or  product*  a  go/no-go 
scoring  system  should  be  used  (either  the  action  occurs  in  the  proper 
sequence  or  it  does  not;  either  the  product  results  or  it  does  not).  If 
me  objective  specifies  characteristics  of  a  criterion-level  product  or 
action,  a  rating  scale  or  other  form  of  point  assignment  Is  indicated. 

Point  assignments  must  be  made  on  an  explicit*  well-defined  basis  for 
each  item.  For  rating  scales,  inter-rater  reliability  must  be  high.  Point 
assignments  must  be  tied  to  criterion  levels  specified  in  the  objective. 
Cut-Off  Points 

Cut-off  levels  should  reflect  mastery  of  the  objective  to  the  extent 
required.  Since  factors  other  than  ability  to  perform  a  task  (such  as 
careless  errors,  measurement  errors*  etc.)  may  affect  an  individual's  score 
cut-off  levels  are  often  set  somewhat  below  100  percent.  If,  for  example, 
an  objective  calls  for  multiplication  of  two  four-digit  numbers*  the 
criterion  might  specify  performing  10  such  sets  within  five  minutes* 
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achieving  the  correct  answer  In  at  least  eight  cases.  Thus*  the  cut-off 
score  of  8  (below  8  -  fail)  reflects  an  arbitrary  definition  of  Mastery. 

Time  mastery  would  require  10  out  of  10. 

Grahaa  (1974)  has  made  some  valuable  suggestions  concerning  the  setting 
of  cut-off  points.  The  cut-off,  basically,  should  discriminate  Masters 
frtni  non-masters.  However,  as  ltea  domains  become  aora  broad.  More  hetero¬ 
geneous  ltea  sets  are  required.  Thus,  the  confounding  Influence  of  skills 
mnA  knowledges  which  are  not  directly  related  to  objectives  increases.  For 
tests  Measuring  objectives  having  broad  domains  (or  several  objectives  with 
different  domains)  the  overlap  between  aastery  and  non-maatery  scores  con¬ 
sequently  widens. 

When  little  overlap  occurs  between  Mastery  and  non-mastery  scores 
(as  is  the  case  for  tests  measuring  a  single  objective  with  a  relatively 
restricted  domain)  setting  a  cut-off  score  is  less  critical.  The  cuL-off 
point  should  reflect-  the  standard  specified  by  the  objective,  and  can  do 
so  without  falling  into  the  sons  of  overlap  between  masters  and  non-masters, 
since  this  zone,  by  definition,  is  either  narrow  or  non-existent.  On  the 
other  hand,  if  the  overlap  is  wide,  the  point  at  which  the  cut-off  score 
19  set,  is  critical.  Wherever  the  cut-off  score  is  set,  there  will  be 
some  aisclasslf lcatlon.  In  such  cases,  there  are  two  considerations. 

First,  -objectives  uust  be  specified  precisely,  with  item  domains  as  restricted 
as  possible,  in  order  to  narrow  the  mastery-uonmastery  overlap.  When 
achievement  of  several  objectives  of  disparate  nature  are  measured  by  a 
single  test,  separate  scores  for  each  objective's  ltea  sat  should  be  obtained, 
iach  with  it*  own  cut-off.  However,  for  end-of -course  or  end-of-cycle 
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«nii  which  utui  high  levels  of  skill  and  knowledge  Integration,  a  single 
cut-off  ney  be  set,  since  whet  In  to  be  evaluated  Is  s  cluster  of  skills 
sad  knowledges  applied  in  coublnatlon. 

Second,  coats  of  false  positives  and  false  negatives  uust  be  considered. 
If  the  costs  for  false,  negatives  are  relatively  high  (e.g. ,  aenpower  needs 
ere  critical)  the  cut-off  score  eight  Justifiably  be  lowered.  If  the 
costs  of  ftlse  positives  are  high,  then  cut-off  scores  aust  remain  high. 

In  any  case,  wkea  performance  on  critical  tasks  Is  tested,  cut-off  points 
must  be  kept  high  enough  to  reflect  the  standards  specified  In  the  objectives 
for  those  tasks. 

Assist  vs  Non-Interference 

In  general,  a  dod- interference  method  of  test  administration  la 
preferred  over  an  assist  method,  la  CX  testing  applications.  In  the  assist 
method,  the  examinee  Is  scored  .no-go  for  a  missed  Item,  corrected,  end  then 
allowed  to  proceed,  k  major  problem  here,.  Is  that  if  the  criterion  requires 
an  examinee  to  complete  a  chain  of  iteps,  he  should  be  tested  on  to  his 
ability  to  do  so.  On  tha  job,  the  examinee  will  have  to  complete  the 
chain  of  steps  correctly,  with  no  help.  There  are  however,  cases  in  which 
an  assist  scoring  technique  can  be  profitably  used.  These  involve  uses  of 
CX  testing  for  diagnosis.  In  such  cases,  the  trainee  is  permitted  to 
complete  a  chain  of  steps  and  given  assistance  on  those  which  he  cannot 
porroru-  adequately.  Ha  la  typically  scored  no-go  for  steps  where  he  Is 
assisted.  The  record  of  no-go  steps  Is  a  useful  diagnostic  cool — remediation 
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- concentrate  on  alaeed  etepe.  Such  record*  nay  alao  be  useful  for 

evaluating  instructional  material ,  especially  If  many  examinees  have 

similar  patterns  of  no-go  items. 

Reliability  and  Validity 

Reliability 

Techniques  for  assessing  OCT  reliability  are,  for  the  most  part, 
either  not  fully  developed  or  are  based  on  questionable  assumptions'*  (For 
example,  see  Livingston,  1972;  Oakland,  1972;  Haladyna,  1974;  and  Woodson, 
1974).  The  need  for  additional  work  in  the  area  of  CRT  reliability, 
continues  to  be  a  pressing  one. 

A  practical  solution  la  to  assess  test-retest  reliability  of  CRTs,  a 
procedure  which  does  not  depend  on  internal  consistency,  and  which  increases 
the  variability  of  test  results,  because  of  the > two  test  administrations 
required.  The  0  coefficient  la  useful  for  analyzing  the  resulting  four¬ 
fold  (first  administration-second  administration,  pass-fall)  data. 
values  less  than  >.50  would  indicate  unacceptable  test-retest  reliability 
for  CRT*. 

Validity 

Content  validation  la  an  especially  appropriate  method  in  CRT  appli¬ 
cations.  A  CRT  is  content  valid  if  the  test  items  are  carefully  based  on 
the  performances,  conditions,  and  standards  specified  in  the  objectives  and 
if  the  ~;st  items  appropriately  sample  objectives.  (Of  course,  the  objec¬ 
tives  themselves  must  be  sound.)  Thus,  in  most  instances,  careful  test 
construction  will,  itself,  enable  the  development  of  content  valid  CRTs. 
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la  instances  where  low  fidelity  CXTs  ere  constructed,  it  may  be 
•ore  difficult  to  determine  content  validity,  since  the  items  era  not  likely 

to  be  precisely  matched  to  objectives.  In  such  cases,  there  ere  two  addi¬ 
tional  types  of  criterion-related  validation  that  are  well-suited  to  CRTs: 

concurrent  validity  and  predictive  validity. 

la  determining  concurrent  validity*  CRT  results  are  compared  with  an 

outside  measure  of  the  behaviors  tested  by  the  CXI.  This  outside  measure 
snst  be  the  beat  available  assessment  of  performance  on  the  objective (s) 

In  question.  The  assessment  of  concurrent  validity*  Involves  individual 
assessment  vie  the  CRT  and  the  outside  measure  close  together  in  tine 
(concurrently),  pf  again  is  used  ou  the  four-fold  data  (CRT-other  measure* 
pass-fall) . 

Predictive  validity  Involves  the  seme  assumptions.  The  outside 
measure  must  be  an  accurate  measure  of  the  performance  in  question*  or  the 
validation  will  be  meaningless.  Predictive  validity  is  calculated  the 
seme  way.  except  the  outside  measure  is  taken  at  a  later  time— i.e.,  when 
the  individuals  are  actually  performing  the  job  for  vtich  they've  been 
trained.  The  $  estimate  is  calculated  Just  an  for  ecncurrent  validity. 
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t  US  Military  Acadtmy.  Wait  Point.  ATTN:  Ofc  of  Milt  Ldrshp 
1  US  Military  Acadamy.  Wait  Point.  ATTN:  MAOR 
I  USA  Standardi/ation  Gp.  UK.  FPO  NV.  ATTN:  MASE-OC 
1  Otc  ot  Naval  Rich,  Arlington,  ATTN:  Coda  4S2 

3  Otc  ot  Naval  Rich.  Arlington,  ATTN:  Coda  468 
I  Otc  ot  Naval  Rich.  Arlington.  ATTN:  Coda  460 
1  Otc  ot  Naval  Rich..  Arlington,  ATTN:  Coda  441 
I  Naval  An  owe  Mad  Rvr  l  ah.  Pamacota.  ATTN:  Acout  Sch  Ohr 
I  Naval  Aaron**  Mad  Rai  Lab.  Pernacoie.  ATTN:  Coda  LSI 
I  Naval  Aarowc  Med  R«  Lab.  Panwcoia.  ATTN:  Coda  LS 
1  Chief  ot  NavfVrs,  ATTN:  PersOR 
1  NAVAIRSTA.  Norfolk,  ATTN:  Safety  Ctr 
1  Nav  Oceanographic,  DC.  ATTN:  Coda  6251,  Chart!  ft  Tech 
1  Center  ot  Naval  Anal,  ATTN:  Doe  Ctr 
1  NavAirSviCom.  ATTN:  AIR-  531X 
1  Nav  BuMed,  ATTN:  713 
1  NavHaHooptarSuhSnua  2.  FFO  SF  88801 
1  AFHRL  (FT)  William*  AFB 
1  AFHRUTT)  Lowry  AF8 

1  AFHRL  IASI  WPAFd,  OH 

2  AFHRL  (DOJZ)  Brooks  AFB 
1  AFHRL  (OOJN)  Lackland  AFB 
1  HQUSAF  (INYSOI 
I  HQUSAF  (DPXXAI 

1  AFVTG  IROI  Randolph  AFB 

3  AMRL  (HE)  WFAFB,  OH 

2  AF  Imt  ot  T«ch.  WFAFB,  OH.  ATTN:  ENE/SL 
1  ATC  IXPTOI  Randolph  AFB 

1  USAF  AeroMed  Lih,  Brook.  AFB  (SUL  -41,  ATTN:  DOC  SEC 
1  AFOSR  (NLI.  Arlington 

t  AF  Log  Cmd.  McClellan  AFB,  ATTN:  ALC/DPCRB 

1  Air  Foroa  Acadamy.  CO,  ATTN:  Dept  of  Bat  Sen 
5  NavPar*  ft  Oev  Ctr.  San  Diego 

2  Navy  Mad  Neuroptychiat'ic  Rich  Unit.  San  Diego 
1  Nav  Electronic  Lab,  San  Oiego.  ATTN:  Re*  Lab 
1  Nav  TrngCen,  San  Diego,  ATTN:  Coda  BOOO-Lib 
1  NavPoitGraSch,  Monterey,  ATTN:  Coda  56Aa 
1  NavPrntGraSch,  Monterey.  ATTN:  Coda  2124 
1  NavTrngEquipCtr,  Orlando.  ATTN:  Tech  Lib 
1  US  Oaptot  Labor,  DC.  ATTN:  Manpower  Admin 
1  US  Dept  of  Juitiee,  DC.  ATTN:  Drug  Entorca  Admin 
1  Nat  Bur  of  Standard*,  DC,  ATTN:  Computer  Info  Section 
1  Nat  Clearing  Home  tor  MH-  Info,  Rockville 
1  Denver  Federal  Ctr.  Lakewood.  ATTN:  BLM 

12  Dgferna  Documentation  Center 

4  Dir  Piyth,  Army  Hq,  Runall  Ofei,  Canberra 
1  Scientific  Adwr,  Mil  Bd,  Army  Hq,  Runall  Ofc*.  Canberra 
1  Mil  and  Air  Attache.  Austrian  Embassy 

1  Centre  da  Recherche  Dei  Fscteun,  Humeine  da  la  Defame 
Nationate,  Brunei* 

2  Canadian  Joint  Staff  Washington 
1  C/Arr  Staff,  Royal  Canadian  AF,  ATTN:  Fan  Sid  Anal  Be 

3  Chref.  Canadian  DM  Rich  Staff.  ATTN:  C/CROSIW) 

4 '  Bniidi  Oaf  Staff.  British  Embssey.  Washington 


